[PATCH 00/12] Integrate commit-graph into fsck, gc, and fetch

2018-05-10 Thread Derrick Stolee
The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Thanks, Jakub, for the feedback on the RFC. I think there are still some
things to decide at a high-level before we dig too far into the review.
Specifically, the integration points in 'fsck', 'gc', and 'fetch' are
worth considering our alternatives.

For 'fsck', the current integration is minimal: if core.commitGraph is
true, then 'git fsck' calls 'git commit-graph verify --object-dir=X'
for the objects directory and every alternate. There are a few options
to consider here:

1. Keep this behavior: we should always check the commit-graph if it
   exists.

2. Add a --[no-]commit-graph argument to 'fsck' that toggles the
   commit-graph verification.

3. Remove all direct integration between 'fsck' and 'commit-graph' and
   instead rely on users checking 'git commit-graph verify' manually
   when they suspect a problem with the commit-graph file. While this
   option is worth considering, it is my least favorite since it requires
   more from users.

For 'gc' and 'fetch', these seemed like natural places to update the
commit-graph file. Relative to the other maintenance that occurs during
these commands, the 'git commit-graph write' command is fast, especially
for incremental updates; only the "new" commits are walked when computing
generation numbers and other metadata for the commit-graph file.

The behavior in this patch series does the following:

1. Near the end of 'git gc', run 'git commit-graph write'. The location
   of this code assumes that a 'git gc --auto' has not terminated early
   due to not meeting the auto threshold.

2. At the end of 'git fetch', run 'git commit-graph write'. This means
   that every reachable commit will be in the commit-graph after a
   a successful fetch, which seems a reasonable frequency. Then, the
   only times we would be missing a reachable commit is after creating
   one locally. There is a problem with the current patch, though: every
   'git fetch' call runs 'git commit-graph write', even if there were no
   ref updates or objects downloaded. Is there a simple way to detect if
   the fetch was non-trivial?

One obvious problem with this approach: if we compute this during 'gc'
AND 'fetch', there will be times where a 'fetch' calls 'gc' and triggers
two commit-graph writes. If I were to abandon one of these patches, it
would be the 'fetch' integration. A 'git gc' really wants to delete all
references to unreachable commits, and without updating the commit-graph
we may still have commit data in the commit-graph file that is not in
the object database. In fact, deleting commits from the object database
but not from the commit-graph will cause 'git commit-graph verify' to
fail!

I welcome discussion on these ideas, as we are venturing out of the
"pure data structure" world and into the "user experience" world. I am
less confident in my skills in this world, but the feature is worthless
if it does not improve the user experience.

Thanks,
-Stolee

Derrick Stolee (12):

Commits 01-07 focus on the 'git commit-graph verify' subcommand. These
are ready for full, rigorous review.

  commit-graph: add 'verify' subcommand
  commit-graph: verify file header information
  commit-graph: parse commit from chosen graph
  commit-graph: verify fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb

Commit 08 integrates 'git commit-graph verify' into fsck. The work here
is the minimum integration possible. (See above discussion of options.)

  fsck: verify commit-graph

Commit 09 introduces a new '--reachable' option only to make the calls
from 'gc' and 'fetch' simpler. Commits 10-11 integrate writing the
commit-graph into 'gc' and 'fetch', respectively. (See above disucssion.)

  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  fetch: compute commit-graph by default

Commit 12 simply deletes sections from the "Future Work" section
of the commit-graph design document.

  commit-graph: update design document

 Documentation/config.txt |  10 ++
 Documentation/git-commit-graph.txt   |  14 ++-
 Documentation/git-fsck.txt   |   3 +
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |  22 
 builtin/commit-graph.c   |  79 +-
 builtin/fetch.c  |  13 +++
 builtin/fsck.c   |  21 
 builtin/gc.c 

Re: [PATCH 2/2] replace-object.c: remove the_repository from prepare_replace_object

2018-05-10 Thread Derrick Stolee

On 5/10/2018 7:56 AM, Jeff King wrote:

On Thu, May 10, 2018 at 07:23:13PM +0900, Junio C Hamano wrote:


This one was doing

ptr = xmalloc(sizeof(*another_ptr))

and it was OK because ptr and another_ptr happened to be of the same
type.  I wonder if we are making it safer, or making it more obscure
to seasoned C programmers, if we introduced a pair of helper macros,
perhaps like these:

#define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr)))
#define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr)))

I've often wondered that, too. It's the natural endgame of the
ALLOC_ARRAY() road we've been going down.


I'll chime in that I like this idea.

Because I'm trying to learn more about Coccinelle, I added the diff 
below and ran 'make coccicheck'. It resulted in a 1000-line patch on 
'next'. I'll refrain from sending that patch series, but it's nice to 
know a "simple" transformation is possible.


Using `git grep` I see 230 instances of 'xmalloc' and 261 instances of 
'xcalloc'. After the Coccinelle transformation, these are down to 194 
and 190, respectively, because the rest allocate in the same line as the 
definition. It's worth thinking about the macro pattern for those cases.


diff --git a/contrib/coccinelle/alloc.cocci b/contrib/coccinelle/alloc.cocci
new file mode 100644
index 00..95f426c4dc
--- /dev/null
+++ b/contrib/coccinelle/alloc.cocci
@@ -0,0 +1,13 @@
+@@
+expression p;
+@@
+- p = xmalloc(sizeof(*p));
++ ALLOCATE(p);
+
+@@
+expression c;
+expression p;
+@@
+- p = xcalloc(c, sizeof(*p));
++ CALLOCATE(p,c);
+
diff --git a/git-compat-util.h b/git-compat-util.h
index f9e4c5f9bc..5398f217d7 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -843,6 +843,9 @@ extern FILE *fopen_or_warn(const char *path, const 
char *mode);

  */
 #define FREE_AND_NULL(p) do { free(p); (p) = NULL; } while (0)

+#define ALLOCATE(ptr) (ptr) = xmalloc(sizeof(*(ptr)))
+#define CALLOCATE(ptr,cnt) (ptr) = xcalloc((cnt), sizeof(*(ptr)))
+
 #define ALLOC_ARRAY(x, alloc) (x) = xmalloc(st_mult(sizeof(*(x)), 
(alloc)))
 #define REALLOC_ARRAY(x, alloc) (x) = xrealloc((x), 
st_mult(sizeof(*(x)), (alloc)))




Re: [PATCH 0/1] Fix UX issue with commit-graph.lock

2018-05-10 Thread Derrick Stolee

On 5/10/2018 3:00 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


We use the lockfile API to avoid multiple Git processes from writing to
the commit-graph file in the .git/objects/info directory. In some cases,
this directory may not exist, so we check for its existence.

The existing code does the following when acquiring the lock:

1. Try to acquire the lock.
2. If it fails, try to create the .git/object/info directory.
3. Try to acquire the lock, failing if necessary.

The problem is that if the lockfile exists, then the mkdir fails, giving
an error that doesn't help the user:

   "fatal: cannot mkdir .git/objects/info: File exists"

Isn't a better immediate fix to make the second step pay attention
to errno?  If mkdir() failed due to EEXIST, then we know we tried to
aquire the lock already, so we can die with an appropriate message.

That way, we can keep the "optimize for the normal case" that the
approach to assume object/info/ directory is already there, instead
of always checking its existence which is almost always true
beforehand.


This "optimize for the normal case" is why the existing code is 
organized the way it is.


Since this code is only for writing a commit-graph file, this "check the 
directory first" option is a very small portion of the full time to 
write the file, so the "optimization" has very little effect, 
relatively. My personal opinion is to make the code cleaner when the 
performance difference is negligible.


I'm willing to concede this point and use the steps you suggest below, 
if we think this is the best way forward.



Also, can't we tell why we failed to acquire the lock at step #1?
Do we only get a NULL that says "I am not telling you why, but we
failed to lock"?


To tell why we failed to acquire the lock, we could inspect "errno". 
However, this requires whitebox knowledge of both the lockfile API and 
the tempfile API to know that the last system call to set errno was an 
open() or adjust_shared_perm(). To cleanly make decisions based on the 
reason the lock failed to acquire, I think we would need to modify the 
lockfile and tempfile APIs to return a failure reason. This could be 
done by passing an `int *reason`, but the extra noise in these APIs is 
likely not worth the change.




What I am getting at is that the ideal sequence
would be more like:
 1. Try to acquire the lock.
 2-a. if #1 succeeds, we are happy. ignore the rest and return
  the lock.
 2-b. if #1 failed because object/info/ did not exist,
  mkdir() it, and die if we cannot, saying "cannot mkdir".
 if mkdir() succeeds, jump t 3.
 2-c. if #1 failed but that is not due to missing object/info/,
 die saying "cannot lock".
 3. Try to acquire the lock.
 4-a. if #3 succeeds, we are happy.ignore the rest and return
  the lock.
 4-b. die saying "cannot lock".



Thanks,
-Stolee


Re: [PATCH 1/1] commit-graph: fix UX issue when .lock file exists

2018-05-09 Thread Derrick Stolee

On 5/9/2018 10:42 AM, Jeff King wrote:

On Wed, May 09, 2018 at 02:15:38PM +, Derrick Stolee wrote:


The commit-graph file lives in the .git/objects/info directory.
Previously, a failure to acquire the commit-graph.lock file was
assumed to be due to the lack of the info directory, so a mkdir()
was called. This gave incorrect messaging if instead the lockfile
was open by another process:

   "fatal: cannot mkdir .git/objects/info: File exists"

Rearrange the expectations of this directory existing to avoid
this error, and instead show the typical message when a lockfile
already exists.

Sounds like a reasonable bug fix.

Your cover letter is way longer than this description. Should some of
that background perhaps go in the commit message?


I did want a place to include the full die() message in the new 
behavior, but that seemed like overkill for the commit message.



(I would go so far as to say that sending a cover letter for a single
patch is an anti-pattern, because the commit message should be able to
stand on its own).


@@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers();
  
  	graph_name = get_commit_graph_filename(obj_dir);

-   fd = hold_lock_file_for_update(, graph_name, 0);
  
-	if (fd < 0) {

-   struct strbuf folder = STRBUF_INIT;
-   strbuf_addstr(, graph_name);
-   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
-
-   if (mkdir(folder.buf, 0777) < 0)
-   die_errno(_("cannot mkdir %s"), folder.buf);
-   strbuf_release();
-
-   fd = hold_lock_file_for_update(, graph_name, 
LOCK_DIE_ON_ERROR);
-
-   if (fd < 0)
-   die_errno("unable to create '%s'", graph_name);
-   }
+   strbuf_addstr(, graph_name);
+   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
+   if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0)
+   die_errno(_("cannot mkdir %s"), folder.buf);
+   strbuf_release();

The result is racy if somebody else is trying to create the directory at
the same time. For that you'd want to notice EEXIST coming from mkdir
and consider that a success.

I think you probably ought to be calling adjust_shared_perm() on the
result, too, in case core.sharedrepository is configured.

If you use safe_create_leading_directories(), it should handle both.
Something like:

   if (safe_create_leading_directories(graph_name))
die_errno(_("unable to create leading directories of %s"),
  graph_name));

I think I'm holding it right; that function is a little short on
documentation, but it's the standard way to do this in Git's codebase,
and you can find lots of example callers.


Thanks for this method. I was unfamiliar with it. Saves the effort of 
creating the strbuf, too.


Thanks,
-Stolee


Re: Implementing reftable in Git

2018-05-09 Thread Derrick Stolee

On 5/9/2018 10:33 AM, Christian Couder wrote:

Hi,

I might start working on implementing reftable in Git soon.

During the last Git Merge conference last March Stefan talked about
reftable. In Alex Vandiver's notes [1] it is asked that people
announce it on the list when they start working on it, and it appears
that there is a reference implementation in JGit.


Thanks for starting on this! In addition to the performance gains, this 
will help a lot of users with case-insensitive file systems from getting 
case-errors on refnames.



Looking it up, there is indeed some documentation [2], code [3], tests
[4] and other related stuff [5] in the JGit repo. It looks like the
JGit repo and the reftable code there are licensed under the Eclipse
Distribution License - v 1.0 [7] which is very similar to the 3-Clause
BSD License also called Modified BSD License which is GPL compatible
according to gnu.org [9]. So from a quick look it appears that I
should be able to port the JGit to Git if I just keep the copyright
and license header comments in all the related files.

So I think the most straightforward and compatible way to do it would
be to port the JGit implementation.

Thanks in advance for any suggestion or comment about this.

Reftable was first described by Shawn and then discussed last July on
the list [6].


The hope is that such a direct port should be possible, but someone else 
should comment on the porting process.


This is also something that could be created independently based on the 
documentation you mention. I was planning to attempt that during a 
hackathon in July, but I'm happy you are able to start earlier (and that 
you are announcing your intentions). I would be happy to review your 
patch series, so please keep me posted.


Thanks,
-Stolee


[PATCH 1/1] commit-graph: fix UX issue when .lock file exists

2018-05-09 Thread Derrick Stolee
The commit-graph file lives in the .git/objects/info directory.
Previously, a failure to acquire the commit-graph.lock file was
assumed to be due to the lack of the info directory, so a mkdir()
was called. This gave incorrect messaging if instead the lockfile
was open by another process:

  "fatal: cannot mkdir .git/objects/info: File exists"

Rearrange the expectations of this directory existing to avoid
this error, and instead show the typical message when a lockfile
already exists.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index a8c337dd77..8399194da1 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -1,5 +1,6 @@
 #include "cache.h"
 #include "config.h"
+#include "dir.h"
 #include "git-compat-util.h"
 #include "lockfile.h"
 #include "pack.h"
@@ -640,13 +641,13 @@ void write_commit_graph(const char *obj_dir,
struct hashfile *f;
uint32_t i, count_distinct = 0;
char *graph_name;
-   int fd;
struct lock_file lk = LOCK_INIT;
uint32_t chunk_ids[5];
uint64_t chunk_offsets[5];
int num_chunks;
int num_extra_edges;
struct commit_list *parent;
+   struct strbuf folder = STRBUF_INIT;
 
oids.nr = 0;
oids.alloc = approximate_object_count() / 4;
@@ -754,23 +755,14 @@ void write_commit_graph(const char *obj_dir,
compute_generation_numbers();
 
graph_name = get_commit_graph_filename(obj_dir);
-   fd = hold_lock_file_for_update(, graph_name, 0);
 
-   if (fd < 0) {
-   struct strbuf folder = STRBUF_INIT;
-   strbuf_addstr(, graph_name);
-   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
-
-   if (mkdir(folder.buf, 0777) < 0)
-   die_errno(_("cannot mkdir %s"), folder.buf);
-   strbuf_release();
-
-   fd = hold_lock_file_for_update(, graph_name, 
LOCK_DIE_ON_ERROR);
-
-   if (fd < 0)
-   die_errno("unable to create '%s'", graph_name);
-   }
+   strbuf_addstr(, graph_name);
+   strbuf_setlen(, strrchr(folder.buf, '/') - folder.buf);
+   if (!file_exists(folder.buf) && mkdir(folder.buf, 0777) < 0)
+   die_errno(_("cannot mkdir %s"), folder.buf);
+   strbuf_release();
 
+   hold_lock_file_for_update(, graph_name, LOCK_DIE_ON_ERROR);
f = hashfd(lk.tempfile->fd, lk.tempfile->filename.buf);
 
hashwrite_be32(f, GRAPH_SIGNATURE);
-- 
2.17.0.39.g685157f7fb



[PATCH 0/1] Fix UX issue with commit-graph.lock

2018-05-09 Thread Derrick Stolee
We use the lockfile API to avoid multiple Git processes from writing to
the commit-graph file in the .git/objects/info directory. In some cases,
this directory may not exist, so we check for its existence.

The existing code does the following when acquiring the lock:

1. Try to acquire the lock.
2. If it fails, try to create the .git/object/info directory.
3. Try to acquire the lock, failing if necessary.

The problem is that if the lockfile exists, then the mkdir fails, giving
an error that doesn't help the user:

  "fatal: cannot mkdir .git/objects/info: File exists"

While technically this honors the lockfile, it does not help the user.

Instead, do the following:

1. Check for existence of .git/objects/info; create if necessary.
2. Try to acquire the lock, failing if necessary.

The new output looks like:

  fatal: Unable to create 
'/home/stolee/git/.git/objects/info/commit-graph.lock': File exists.

  Another git process seems to be running in this repository, e.g.
  an editor opened by 'git commit'. Please make sure all processes
  are terminated then try again. If it still fails, a git process
  may have crashed in this repository earlier:
  remove the file manually to continue.

This patch is based on ds/generation-numbers

Derrick Stolee (1):
  commit-graph: fix UX issue when .lock file exists

 commit-graph.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)


base-commit: 7547b95b4fbb8591726b1d9381c176cc27fc6aea
-- 
2.17.0.39.g685157f7fb



Re: What's cooking in git.git (May 2018, #01; Mon, 7)

2018-05-07 Thread Derrick Stolee

On 5/7/2018 10:58 AM, Junio C Hamano wrote:

* ds/generation-numbers (2018-05-02) 11 commits
  - commit-graph.txt: update design document
  - merge: check config before loading commits
  - commit: use generation number in remove_redundant()
  - commit: add short-circuit to paint_down_to_common()
  - commit: use generation numbers for in_merge_bases()
  - ref-filter: use generation number for --contains
  - commit-graph: always load commit-graph information
  - commit: use generations in paint_down_to_common()
  - commit-graph: compute generation numbers
  - commit: add generation number to struct commmit
  - ref-filter: fix outdated comment on in_commit_list
  (this branch uses ds/commit-graph and ds/lazy-load-trees.)

  A recently added "commit-graph" datafile has learned to store
  pre-computed generation numbers to speed up the decisions to stop
  history traversal.

  Is this ready for 'next'?


I see that you squashed the fix from [1], so I think this is ready to 
go. Thanks!


[1] 
https://public-inbox.org/git/1cfe38f6-925b-d36b-53ae-6b586eed1...@gmail.com/



* ds/lazy-load-trees (2018-05-02) 6 commits
   (merged to 'next' on 2018-05-02 at d54016d9e3)
  + coccinelle: avoid wrong transformation suggestions from commit.cocci
   (merged to 'next' on 2018-04-25 at b90813f421)
  + commit-graph: lazy-load trees for commits
  + treewide: replace maybe_tree with accessor methods
  + commit: create get_commit_tree() method
  + treewide: rename tree to maybe_tree
  + Merge branch 'bw/c-plus-plus' into ds/lazy-load-trees
  (this branch is used by ds/generation-numbers; uses ds/commit-graph.)

  The code has been taught to use the duplicated information stored
  in the commit-graph file to learn the tree object name for a commit
  to avoid opening and parsing the commit object when it makes sense
  to do so.

  Will merge to 'master'.


* ds/commit-graph (2018-04-11) 16 commits
   (merged to 'next' on 2018-04-25 at 18af3d28d9)
  + commit-graph: implement "--append" option
  + commit-graph: build graph from starting commits
  + commit-graph: read only from specific pack-indexes
  + commit: integrate commit graph with commit parsing
  + commit-graph: close under reachability
  + commit-graph: add core.commitGraph setting
  + commit-graph: implement git commit-graph read
  + commit-graph: implement git-commit-graph write
  + commit-graph: implement write_commit_graph()
  + commit-graph: create git-commit-graph builtin
  + graph: add commit graph design document
  + commit-graph: add format document
  + csum-file: refactor finalize_hashfile() method
  + csum-file: rename hashclose() to finalize_hashfile()
  + Merge branch 'jk/cached-commit-buffer' into HEAD
  + Merge branch 'jt/binsearch-with-fanout' into HEAD
  (this branch is used by ds/generation-numbers and ds/lazy-load-trees.)

  Precompute and store information necessary for ancestry traversal
  in a separate file to optimize graph walking.

  Will merge to 'master'.


These have been queued for master for a few weeks. Is anything delaying 
them? I'd love to see the community dogfood this feature by running the 
following on their local repos:


    git config core.commitGraph true
    git show-ref -s | git commit-graph write --stdin-commits

Thanks,
-Stolee


Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

2018-05-07 Thread Derrick Stolee

On 5/4/2018 3:40 PM, Jakub Narebski wrote:

Hello,

With early parts of commit-graph feature (ds/commit-graph and
ds/lazy-load-trees) close to being merged into "master", see
https://public-inbox.org/git/xmqq4ljtz87g@gitster-ct.c.googlers.com/
I think it would be good idea to think what other data could be added
there to make Git even faster.


Before thinking about adding more data to the commit-graph, I think 
instead we need to finish taking advantage of the data that is already 
there. This means landing the generation number patch [1] (I think this 
is close, so I'll send a v6 this week if there is no new feedback.) and 
the auto-compute patch [2] (this could use more feedback, but I'll send 
a v1 based on the RFC feedback if no one chimes in).


[1] 
https://public-inbox.org/git/20180501124652.155781-1-dsto...@microsoft.com/

    [PATCH v5 00/11] Compute and consume generation numbers

[2] 
https://public-inbox.org/git/20180417181028.198397-1-dsto...@microsoft.com/

    [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

The big wins remaining from this data are `git tag --merged` and `git 
log --graph`. The `tag` scenario is probably easier: this can be done by 
replacing the revision-walk underlying the call to use 
paint_down_to_common() instead. Requires adding an external method to 
commit.c, but not too much code.


The tougher challenge is `git log --graph`. The revision walk machinery 
currently uses two precompute phases before iterating results to the 
pager: limit_list() and sort_in_topological_order(); these correspond to 
two phases of Kahn's algorithm for topo-sort (compute in-degrees, then 
walk by peeling commits with in-degree zero). This requires O(N) time, 
where N is the number of reachable commits. Instead, we could make this 
be O(W) time to output one page of results, where W is (roughly) the 
number of reachable commits with generation number above the last 
reported result.


In order to take advantage of this approach, the two phases of Kahn's 
algorithm need to be done in-line with reporting results to the pager. 
This means keeping two queues: one is a priority queue by generation 
number that computes in-degrees, the other is a priority queue (by 
commit-date or a visit-order value to do the --topo-order priority) that 
peels the in-degree-zero commits (and decrements the in-degree of their 
parents). I have not begun this refactoring effort because appears 
complicated to me, and it will be hard to tease out the logic without 
affecting other consumers of the revision-walk machinery.


I would love it if someone picked up the `git log --graph` task, since 
it will be a few weeks before I have the time to focus on it.


Without completing the benefits we get from generation numbers, these 
investigations into other reachability indexes will be incomplete as 
they are comparing benefits without all consumers taking advantage of a 
reachability index.


[...]

Bloom filter for changed paths
--

The goal of this chunk is to speed up checking if the file or directory
was changed in given commit, for queries such as "git log -- " or
"git blame ".  This is something that according to "Git Merge
contributor summit notes" [2] is already present in VSTS (Visual Studio
Team Services - the server counterpart of GVFS: Git Virtual File System)
at Microsoft:

AV> - VSTS adds bloom filters to know which paths have changed on the commit
AV> - tree-same check in the bloom filter is fast; speeds up file history checks
AV> - might be useful in the client as well, since limited-traversal is common
AV> - if the file history is _very_ sparse, then bloom filter is useful
AV> - but needs pre-compute, so useful to do once
AV> - first make the client do it, then think about how to serve it centrally

[2]: 
https://public-inbox.org/git/alpine.DEB.2.20.1803091557510.23109@alexmv-linux/

I think it was what Derrick Stolee was talking about at the end of his
part of "Making Git for Windows" presentation at Git Merge 2018:
https://youtu.be/oOMzi983Qmw?t=1835

This was also mentioned in subthread of "Re: [PATCH v2 0/4] Lazy-load
trees when reading commit-graph", starting from [3]
[3]: https://public-inbox.org/git/86y3hyeu6c@gmail.com/


Again, the benefits of Bloom filters should only be measured after 
already taking advantage of a reachability index during `git log`. 
However, you could get performance benefits from Bloom filters in a 
normal `git log` (no topo-order).


The tricky part about this feature is that the decisions we made in our 
C# implementation for the VSTS Git server may be very different than the 
needs for the C implementation of the Git client. Questions like "how do 
we handle merge commits?" may have different answers, which can only be 
discovered by implementing the feature.


(The answer for VSTS is that we only store Bloom filters containin

Re: [PATCH v3 00/12] get_short_oid UI improvements

2018-05-02 Thread Derrick Stolee

On 5/2/2018 8:42 AM, Derrick Stolee wrote:

On 5/1/2018 2:40 PM, Ævar Arnfjörð Bjarmason wrote:

The biggest change in v3 is the no change at all to the code, but a
lengthy explanation of why I didn't go for Derrick's simpler
implementation. Maybe I'm wrong about that, but I felt uneasy
offloading undocumented (or if I documented it, it would only be for
this one edge-case) magic on the oid_array API. Instead I'm just
making this patch a bit more complex.


I think that's fair. Thanks for going along with me on the thought 
experiment.


Also, v3 looks good to me.

Reviewed-by: Derrick Stolee <dsto...@microsoft.com>



Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()

2018-05-02 Thread Derrick Stolee

On 5/2/2018 9:05 AM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:

     For a copy of the Linux repository, we measured the following
     performance improvements:

     git merge-base v3.3 v4.5

     Before: 234 ms
  After: 208 ms
  Rel %: -11%

     git merge-base v4.3 v4.5

     Before: 102 ms
  After:  83 ms
  Rel %: -19%

     The experiments above were chosen to demonstrate that we are
     improving the filtering of the merge-base set. In the first
     example, more time is spent walking the history to find the
     set of merge bases before the remove_redundant() call. The
     starting commits are closer together in the second example,
     therefore more time is spent in remove_redundant(). The relative
     change in performance differs as expected.

Nice.

I was not expecting as much performance improvements as we got for
--contains tests because remove_redundant() is a final step in longer
process, dominated by man calculations.  Still, nothing to sneeze about.


One reason these numbers are not too surprising is that 
remove_redundant() can demonstrate quadratic behavior. It is calculating 
pair-wise reachability by starting a walk at each of the candidates (in 
the worst case). In typical cases, the first walk marks many of the 
other candidates as redundant and we don't need to start walks from 
those commits.


A possible optimization could be to sort the candidates by descending 
generation so we find the first walk is likely to mark the rest as 
redundant. But this may already be the case if the candidates are added 
to the list in order of "discovery" which is already simulating this 
behavior.


Thanks,
-Stolee


Re: [PATCH v3 00/12] get_short_oid UI improvements

2018-05-02 Thread Derrick Stolee

On 5/1/2018 2:40 PM, Ævar Arnfjörð Bjarmason wrote:

The biggest change in v3 is the no change at all to the code, but a
lengthy explanation of why I didn't go for Derrick's simpler
implementation. Maybe I'm wrong about that, but I felt uneasy
offloading undocumented (or if I documented it, it would only be for
this one edge-case) magic on the oid_array API. Instead I'm just
making this patch a bit more complex.


I think that's fair. Thanks for going along with me on the thought 
experiment.


-Stolee


Re: [PATCH v5 09/11] commit: use generation number in remove_redundant()

2018-05-01 Thread Derrick Stolee



On 5/1/2018 8:47 AM, Derrick Stolee wrote:

The static remove_redundant() method is used to filter a list
of commits by removing those that are reachable from another
commit in the list. This is used to remove all possible merge-
bases except a maximal, mutually independent set.

To determine these commits are independent, we use a number of
paint_down_to_common() walks and use the PARENT1, PARENT2 flags
to determine reachability. Since we only care about reachability
and not the full set of merge-bases between 'one' and 'twos', we
can use the 'min_generation' parameter to short-circuit the walk.

When no commit-graph exists, there is no change in behavior.

For a copy of the Linux repository, we measured the following
performance improvements:

git merge-base v3.3 v4.5

Before: 234 ms
  After: 208 ms
  Rel %: -11%

git merge-base v4.3 v4.5

Before: 102 ms
  After:  83 ms
  Rel %: -19%

The experiments above were chosen to demonstrate that we are
improving the filtering of the merge-base set. In the first
example, more time is spent walking the history to find the
set of merge bases before the remove_redundant() call. The
starting commits are closer together in the second example,
therefore more time is spent in remove_redundant(). The relative
change in performance differs as expected.

Reported-by: Jakub Narebski <jna...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit.c | 7 ++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 9875feec01..5064db4e61 100644
--- a/commit.c
+++ b/commit.c
@@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
parse_commit(array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;


This initialization should be

    uint32_t min_generation = array[i]->generation;

since the assignment (using j) below skips the ith commit.

  
  		if (redundant[i])

continue;
@@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
continue;
filled_index[filled] = j;
work[filled++] = array[j];
+
+   if (array[j]->generation < min_generation)
+   min_generation = array[j]->generation;
}
-   common = paint_down_to_common(array[i], filled, work, 0);
+   common = paint_down_to_common(array[i], filled, work,
+ min_generation);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)




Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 5/1/2018 10:10 AM, Ævar Arnfjörð Bjarmason wrote:

Actually I'm having second thoughts about that and thinking I might keep
my original approach (with a better explanation).

A few more lines of code seems worthwhile in order to not break the
assumptions a documented API is making, no matter how briefly, so I set
about documenting this case and supporting it, since
e.g. oid_array_lookup() will completely fail with the hack of setting
the .sorted member, and came up with this:

diff --git a/Documentation/technical/api-oid-array.txt 
b/Documentation/technical/api-oid-array.txt
index b0c11f868d..ff87260220 100644
--- a/Documentation/technical/api-oid-array.txt
+++ b/Documentation/technical/api-oid-array.txt
@@ -16,6 +16,20 @@ Data Structures
the actual data. The `nr` member contains the number of items in
the set.  The `alloc` and `sorted` members are used internally,
and should not be needed by API callers.
++
+Both the `oid_array_lookup` and `oid_array_for_each_unique` functions
+rely on the array being sorted. For the former it's an absolute
+requirenment that the internal `oid_array_sort` function has been
+called on it, bu for the latter it's enough that the elements are
+ordered in such a way as to guarantee that identical object IDs are
+adjacent in the array.


s/bu/but/


++
+This is useful e.g. to print output where commits, tags etc. are
+grouped together (barring a hash collision they won't have the same
+object ID), in such cases the `custom_sorted` member can be set to `1`
+before calling `oid_array_for_each_unique`, and it'll skip its own
+sorting. Once it's been set calling e.g. `oid_array_lookup` without it
+being cleared again will cause an internal panic, so use it carefully.

  Functions
  -
diff --git a/sha1-array.c b/sha1-array.c
index 466a926aa3..cbae07ff78 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -18,6 +18,7 @@ static void oid_array_sort(struct oid_array *array)
  {
QSORT(array->oid, array->nr, void_hashcmp);
array->sorted = 1;
+   array->custom_sorted = 0;
  }

  static const unsigned char *sha1_access(size_t index, void *table)
@@ -28,6 +29,13 @@ static const unsigned char *sha1_access(size_t index, void 
*table)

  int oid_array_lookup(struct oid_array *array, const struct object_id *oid)
  {
+   if (array->custom_sorted)
+   /*
+* We could also just clear custom_sorted here, but if
+* the caller is custom sorting and then calling this
+* that's likely something they'd like to know about.
+*/
+   BUG("PANIC: Cannot lookup OIDs in arrays with a custom sort!");


Probably don't need the "PANIC: " here.


if (!array->sorted)
oid_array_sort(array);
return sha1_pos(oid->hash, array->oid, array->nr, sha1_access);
@@ -39,6 +47,7 @@ void oid_array_clear(struct oid_array *array)
array->nr = 0;
array->alloc = 0;
array->sorted = 0;
+   array->custom_sorted = 0;
  }

  int oid_array_for_each_unique(struct oid_array *array,
@@ -47,7 +56,7 @@ int oid_array_for_each_unique(struct oid_array *array,
  {
int i;

-   if (!array->sorted)
+   if (!array->sorted && !array->custom_sorted)
oid_array_sort(array);

for (i = 0; i < array->nr; i++) {
diff --git a/sha1-array.h b/sha1-array.h
index 1e1d24b009..bfa77ba1e4 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -6,6 +6,7 @@ struct oid_array {
int nr;
int alloc;
int sorted;
+   int custom_sorted;
  };

  #define OID_ARRAY_INIT { NULL, 0, 0, 0 }
diff --git a/sha1-name.c b/sha1-name.c
index b81e07adbb..d190800db0 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -490,9 +490,11 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, 
void *cb_data)
find_short_packed_object();

QSORT(collect.oid, collect.nr, sort_ambiguous);
-   collect.sorted = 1;

+   collect.custom_sorted = 1;
ret = oid_array_for_each_unique(, fn, cb_data);
+   collect.custom_sorted = 0;
+
oid_array_clear();
return ret;
  }

So maybe I should just stop worrying and YOLO it, it just seems wrong to
leave such a fragile setup in place where we set .sorted=1 and some
future refactoring reasonably tries to call oid_array_lookup() on it and
silently fails.

What do you think?
I think this extra custom_sort check is worth keeping the API stable to 
future changes.


Thanks,
-Stolee


Re: [PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee



On 5/1/2018 9:39 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


From: Ævar Arnfjörð Bjarmason <ava...@gmail.com>

Here is what I mean by sorting during for_each_abbrev(). This seems to work for
me, so I don't know what the issue is with this one-pass approach.
[...]
+static int sort_ambiguous(const void *a, const void *b)
+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/
+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
  {
@@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, 
void *cb_data)
find_short_object_filename();
find_short_packed_object();

+   QSORT(collect.oid, collect.nr, sort_ambiguous);
+   collect.sorted = 1;
+

Yes this works. You're right. I wasn't trying to intentionally omit
stuff in my recent 878t93zh60@evledraar.gmail.com, I'd just written
this code some days ago and forgotten why I did what I was doing (and
this is hard to test for), but it's all coming back to me now.

The actual requirement for oid_array_for_each_unique() working properly
is that you've got to feed it in hash order,


To work properly, duplicate entries must be consecutive. Since duplicate 
entries have the same type, our sort satisfies this condition.



but my new sort_ambiguous()
still does that (barring any SHA-1 collisions, at which point we have
bigger problems), so two passes aren't needed. So yes, this apporoach
works and is one-pass.

But that's just an implementation detail of the current sort method,
when I wrote this I was initially playing with other sort orders,
e.g. sorting SHAs regardless of type by the mtime of the file I found
them in. With this approach I'd start printing duplicates if I changed
the internals of sort_ambiguous() like that.


That makes sense.


But I think it's extremely implausible that we'll start sorting things
like that, so I'll just take this method of doing it and add some
comment saying we must hashcmp() the entries in our own sort function
for the de-duplication to work, I don't see us ever changing that.


Sounds good.

Thanks,
-Stolee


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 5/1/2018 8:36 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


How would sorting in our custom order before de-duplicating fail the
de-duplication? We will still pair identical OIDs as consecutive
elements and oid_array_for_each_unique only cares about consecutive
elements having distinct OIDs, not lex-ordered OIDs.

Because there's no de-duplication without the array first being sorted
in oidcmp() order, which oid_array_for_each_unique() checks for and
re-sorts if !array->sorted. I.e. its de-duplication is just a state
machine where it won't call the callback if the currently processed
element has the same SHA1 as the last one.


Perhaps the noise is because we rely on oid_array_sort() to mark the
array as sorted inside oid_array_for_each_unique(), but that could be
remedied by calling our QSORT() inside for_each_abbrev() and marking
the array as sorted before calling oid_array_for_each_unique().

As noted above this won't work, because the function inherently relies
on the array being sorted to be able to de-duplicate. Doing this will
yield duplicate entries.


I'm confused as to why my suggestion doesn't work, so I made it 
concrete. I sent an alternate commit 6/12 to your v2 series [1].


Thanks,
-Stolee

[1] 
https://public-inbox.org/git/20180501130318.58251-1-dsto...@microsoft.com/T/#u


[PATCH v2 06/11] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee
From: Ævar Arnfjörð Bjarmason <ava...@gmail.com>

Here is what I mean by sorting during for_each_abbrev(). This seems to work for
me, so I don't know what the issue is with this one-pass approach.

Thanks,
-Stolee

-- >8 --

Change the output emitted when an ambiguous object is encountered so
that we show tags first, then commits, followed by trees, and finally
blobs. Within each type we show objects in hashcmp() order. Before
this change the objects were only ordered by hashcmp().

The reason for doing this is that the output looks better as a result,
e.g. the v2.17.0 tag before this change on "git show e8f2" would
display:

hint: The candidates are:
hint:   e8f2093055 tree
hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
hint:   e8f21d02f7 blob
hint:   e8f21d577c blob
hint:   e8f25a3a50 tree
hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
hint:   e8f2650052 tag v2.17.0
hint:   e8f2867228 blob
hint:   e8f28d537c tree
hint:   e8f2a35526 blob
hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
hint:   e8f2cf6ec0 tree

Now we'll instead show:

hint:   e8f2650052 tag v2.17.0
hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
hint:   e8f2093055 tree
hint:   e8f25a3a50 tree
hint:   e8f28d537c tree
hint:   e8f2cf6ec0 tree
hint:   e8f21d02f7 blob
hint:   e8f21d577c blob
hint:   e8f2867228 blob
hint:   e8f2a35526 blob

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary. I don't want to order by object_type since there
tags come last after blobs, which doesn't make sense if we want to
show the most important things first.

I could display them after commits, but it's much less likely that
we'll display a tag, so if there is one it makes sense to show it
prominently at the top.

Signed-off-by: Ævar Arnfjörð Bjarmason <ava...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 sha1-name.c | 31 +
 t/t1512-rev-parse-disambiguation.sh | 21 +++
 2 files changed, 52 insertions(+)

diff --git a/sha1-name.c b/sha1-name.c
index 5deebab56d..0336630c64 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -385,6 +385,34 @@ static int collect_ambiguous(const struct object_id *oid, 
void *data)
return 0;
 }
 
+static int sort_ambiguous(const void *a, const void *b)
+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/
+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
 static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
 {
@@ -451,6 +479,9 @@ int for_each_abbrev(const char *prefix, each_abbrev_fn fn, 
void *cb_data)
find_short_object_filename();
find_short_packed_object();
 
+   QSORT(collect.oid, collect.nr, sort_ambiguous);
+   collect.sorted = 1;
+
ret = oid_array_for_each_unique(, fn, cb_data);
oid_array_clear();
return ret;
diff --git a/t/t1512-rev-parse-disambiguation.sh 
b/t/t1512-rev-parse-disambiguation.sh
index c7ceda2f21..74e7d9c178 100755
--- a/t/t1512-rev-parse-disambiguation.sh
+++ b/t/t1512-rev-parse-disambiguation.sh
@@ -364,4 +364,25 @@ test_expect_success 'core.disambiguate does not override 
context' '
git -c core.disambiguate=committish rev-parse $sha1^{tree}
 '
 
+test_expect_success C_LOCALE_OUTPUT 'ambiguous commits are printed by type 
first, then hash order' '
+   test_must_fail git rev-parse  2>stderr &&
+   grep ^hint: stderr >hints &&
+   grep  hints >objects &&
+   cat >expected <<-\EOF &&
+   tag
+   

[PATCH v5 07/11] commit: use generation numbers for in_merge_bases()

2018-05-01 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the initial commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 39a3749abd..3ecdc13356 100644
--- a/commit.c
+++ b/commit.c
@@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (reference[i]->generation < min_generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return ret;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb



[PATCH v5 01/11] ref-filter: fix outdated comment on in_commit_list

2018-05-01 Thread Derrick Stolee
The in_commit_list() method does not check the parents of
the candidate for containment in the list. Fix the comment
that incorrectly states that it does.

Reported-by: Jakub Narebski <jna...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..aff24d93be 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
 }
 
 /*
- * Test whether the candidate or one of its parents is contained in the list.
+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
-- 
2.17.0.39.g685157f7fb



[PATCH v5 08/11] commit: add short-circuit to paint_down_to_common()

2018-05-01 Thread Derrick Stolee
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 3ecdc13356..9875feec01 100644
--- a/commit.c
+++ b/commit.c
@@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
 {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -831,6 +834,15 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip %8x > %8x at %s",
+   commit->generation, last_gen,
+   oid_to_hex(>object.oid));
+   last_gen = commit->generation;
+
+   if (commit->generation < min_generation)
+   break;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -879,7 +891,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
 
-   list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
 
while (list) {
struct commit *commit = pop_commit();
@@ -946,7 +958,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1070,7 +1082,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return ret;
 
-   bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb



[PATCH v5 03/11] commit-graph: compute generation numbers

2018-05-01 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..36d765e10a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,8 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   packedDate[0] |= htonl((*list)->generation << 2);
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -571,6 +573,45 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct packed_commit_list* commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < commits->nr; i++) {
+   if (commits->list[i]->generation != GENERATION_NUMBER_INFINITY 
&&
+   commits->list[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits->list[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+
+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = 
GENERATION_NUMBER_MAX;
+   }
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +735,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers();
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb



[PATCH v5 05/11] commit-graph: always load commit-graph information

2018-05-01 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 46 +++---
 commit-graph.h |  8 
 commit.c   |  7 +--
 commit.h   |  2 +-
 object.c   |  2 +-
 sha1_file.c|  2 +-
 6 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 36d765e10a..a8c337dd77 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,13 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph 
*g, uint32_t pos)
+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->graph_pos = pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
uint32_t edge_value;
@@ -292,31 +299,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t *pos)
+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(g, &(item->object.oid), pos);
+   }
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+   uint32_t pos;
+
if (!core_commit_graph)
return 0;
if (item->object.parsed)
return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, 
&(item->object.oid), );
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
 {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index 4d00b0a1d6..39a3749abd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, 
unsigned long *sizep)
return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph)
 {
const char *tail = buffer;
const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
}
item->date = parse_commit_date(bufptr, tail);
 
+   if (check_graph)
+   load_commit_graph_info(item);
+
return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, 

[PATCH v5 11/11] commit-graph.txt: update design document

2018-05-01 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 29 
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..e1a883eb46 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --
 
@@ -98,18 +121,14 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
   operations are important candidates:
 
-- paint_down_to_common()
 - 'log --topo-order'
+- 'tag --merged'
 
 - Currently, parse_commit_gently() requires filling in the root tree
   object for a commit. This passes through lookup_tree() and consequently
-- 
2.17.0.39.g685157f7fb



[PATCH v5 00/11] Compute and consume generation numbers

2018-05-01 Thread Derrick Stolee
t_list *common;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;

if (redundant[i])
continue;
@@ -955,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
continue;
filled_index[filled] = j;
work[filled++] = array[j];
+
+   if (array[j]->generation < min_generation)
+   min_generation = array[j]->generation;
}
-   common = paint_down_to_common(array[i], filled, work, 0);
+   common = paint_down_to_common(array[i], filled, work,
+ min_generation);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1073,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
-   if (min_generation > reference[i]->generation)
+   if (reference[i]->generation < min_generation)
min_generation = reference[i]->generation;
}


-- >8 --

Derrick Stolee (11):
  ref-filter: fix outdated comment on in_commit_list
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph: always load commit-graph information
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  commit: use generation number in remove_redundant()
  merge: check config before loading commits
  commit-graph.txt: update design document

 Documentation/technical/commit-graph.txt | 30 ++--
 alloc.c  |  1 +
 builtin/merge.c  |  7 +-
 commit-graph.c   | 91 
 commit-graph.h   |  8 +++
 commit.c | 61 +---
 commit.h |  7 +-
 object.c |  2 +-
 ref-filter.c | 26 +--
 sha1_file.c  |  2 +-
 t/t5318-commit-graph.sh  |  9 +++
 11 files changed, 204 insertions(+), 40 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb



[PATCH v5 06/11] ref-filter: use generation number for --contains

2018-05-01 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is contained in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index aff24d93be..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"
 
 static struct ref_msg {
const char *gone;
@@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit 
*candidate,
 
/* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   load_commit_graph_info(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb



[PATCH v5 09/11] commit: use generation number in remove_redundant()

2018-05-01 Thread Derrick Stolee
The static remove_redundant() method is used to filter a list
of commits by removing those that are reachable from another
commit in the list. This is used to remove all possible merge-
bases except a maximal, mutually independent set.

To determine these commits are independent, we use a number of
paint_down_to_common() walks and use the PARENT1, PARENT2 flags
to determine reachability. Since we only care about reachability
and not the full set of merge-bases between 'one' and 'twos', we
can use the 'min_generation' parameter to short-circuit the walk.

When no commit-graph exists, there is no change in behavior.

For a copy of the Linux repository, we measured the following
performance improvements:

git merge-base v3.3 v4.5

Before: 234 ms
 After: 208 ms
 Rel %: -11%

git merge-base v4.3 v4.5

Before: 102 ms
 After:  83 ms
 Rel %: -19%

The experiments above were chosen to demonstrate that we are
improving the filtering of the merge-base set. In the first
example, more time is spent walking the history to find the
set of merge bases before the remove_redundant() call. The
starting commits are closer together in the second example,
therefore more time is spent in remove_redundant(). The relative
change in performance differs as expected.

Reported-by: Jakub Narebski <jna...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 9875feec01..5064db4e61 100644
--- a/commit.c
+++ b/commit.c
@@ -949,6 +949,7 @@ static int remove_redundant(struct commit **array, int cnt)
parse_commit(array[i]);
for (i = 0; i < cnt; i++) {
struct commit_list *common;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (redundant[i])
continue;
@@ -957,8 +958,12 @@ static int remove_redundant(struct commit **array, int cnt)
continue;
filled_index[filled] = j;
work[filled++] = array[j];
+
+   if (array[j]->generation < min_generation)
+   min_generation = array[j]->generation;
}
-   common = paint_down_to_common(array[i], filled, work, 0);
+   common = paint_down_to_common(array[i], filled, work,
+ min_generation);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
-- 
2.17.0.39.g685157f7fb



[PATCH v5 10/11] merge: check config before loading commits

2018-05-01 Thread Derrick Stolee
Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/merge.c | 7 ---
 t/t5318-commit-graph.sh | 9 +
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
 
-   init_diff_ui_defaults();
-   git_config(git_merge_config, NULL);
-
if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



[PATCH v5 02/11] commit: add generation number to struct commmit

2018-05-01 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 4 
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb



[PATCH v5 04/11] commit: use generations in paint_down_to_common()

2018-05-01 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 20 +++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..4d00b0a1d6 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generations are equal */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.39.g685157f7fb



Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 5/1/2018 7:27 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, May 01 2018, Derrick Stolee wrote:


On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.

Here's a non-arbitrary reason: the object types are ordered
topologically (ignoring self-references):

tag -> commit, tree, blob
commit -> tree
tree -> blob

Thanks. I'll add a patch with that comment to v2.


@@ -421,7 +451,12 @@ static int get_short_oid(const char *name, int len, struct 
object_id *oid,
ds.fn = NULL;
advise(_("The candidates are:"));
-   for_each_abbrev(ds.hex_pfx, show_ambiguous_object, );
+   for_each_abbrev(ds.hex_pfx, collect_ambiguous, );
+   QSORT(collect.oid, collect.nr, sort_ambiguous);

I was wondering how the old code sorted by SHA even when the ambiguous
objects were loaded from different sources (multiple pack-files, loose
objects). Turns out that for_each_abbrev() does its own sort after
collecting the SHAs and then calls the given function pointer only
once per distinct object. This avoids multiple instances of the same
object, which may appear multiple times across pack-files.

I only ask because now we are doing two sorts. I wonder if it would be
more elegant to provide your sorting algorithm to for_each_abbrev()
and let it call show_ambiguous_object as before.

Another question is if we should use this sort generally for all calls
to for_each_abbrev(). The only other case I see is in
builtin/revparse.c.

When preparing v2 I realized how confusing this was, so I'd added this
to the commit message of my WIP re-roll which should explain this:

 A note on the implementation: I started out with something much
 simpler which just replaced oid_array_sort() in sha1-array.c with a
 custom sort function before calling oid_array_for_each_unique(). But
 then dumbly noticed that it doesn't work because the output function
 was tangled up with the code added in fad6b9e590 ("for_each_abbrev:
 drop duplicate objects", 2016-09-26) to ensure we don't display
 duplicate objects.
 
 That's why we're doing two passes here, first we need to sort the list

 and de-duplicate the objects, then sort them in our custom order, and
 finally output them without re-sorting them. I suppose we could also
 make oid_array_for_each_unique() maintain a hashmap of emitted
 objects, but that would increase its memory profile and wouldn't be
 worth the complexity for this one-off use-case,
 oid_array_for_each_unique() is used in many other places.


How would sorting in our custom order before de-duplicating fail the 
de-duplication? We will still pair identical OIDs as consecutive 
elements and oid_array_for_each_unique only cares about consecutive 
elements having distinct OIDs, not lex-ordered OIDs.


Perhaps the noise is because we rely on oid_array_sort() to mark the 
array as sorted inside oid_array_for_each_unique(), but that could be 
remedied by calling our QSORT() inside for_each_abbrev() and marking the 
array as sorted before calling oid_array_for_each_unique().


(Again, my comments are not meant to block this series.)

Thanks,
-Stolee


Re: [PATCH v4 05/10] commit-graph: always load commit-graph information

2018-05-01 Thread Derrick Stolee

On 4/29/2018 6:14 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Most code paths load commits using lookup_commit() and then
parse_commit().

And this automatically loads commit graph if needed, thanks to changes
in parse_commit_gently(), which parse_commit() uses.


 In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

I guess the problem is that we cannot just add parse_commit_in_graph()
like we did in parse_commit_gently(), for some reason?  Like for example
that parse_commit_gently() uses parse_commit_buffer() - which could have
been handled by moving parse_commit_in_graph() down the call chain from
parse_commit_gently() to parse_commit_buffer()... if not the fact that
check_commit() also uses parse_commit_buffer(), but it does not want to
load commit graph.  Am I right?


If a caller uses parse_commit_buffer() directly, then we will guarantee 
that all values in the struct commit that would be loaded from the 
buffer are loaded from the buffer. This means we do NOT load the root 
tree id or commit date from the commit-graph file. We do still need to 
load the data that is not available in the buffer, such as graph_pos and 
generation.





With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Is it generation number, or generation number and position in commit
graph?


We don't need to ensure the graph_pos (the commit will never be 
re-parsed, so we will not try to find it in the commit-graph file 
again), but we DO need to ensure the generation (or our commit walks 
will be incorrect). We get the graph_pos as a side-effect.





Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

I think this commit would be easier to review if it was split into pure
refactoring part (extracting fill_commit_graph_info() and
find_commit_in_graph()).  On the other hand the refactoring was needed
to reduce code duplication betweem existing parse_commit_in_graph() and
new load_commit_graph_info() functions.

I guess that the difference between parse_commit_in_graph() and
load_commit_graph_info() is that the former cares only about having just
enough information that is needed for parse_commit_gently() - and does
not load graph data if commit is parsed, while the latter is about
loading commit-graph data like generation numbers.


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 45 ++---
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 46 insertions(+), 20 deletions(-)

I wonder if it would be possible to add tests for this feature, for
example that commit-graph is read when it should (including those branch
lookups), and is not read when the feature should be disabled.

But the only way to test it I can think of is a stupid one: create
invalid commit graph, and check that git fails as expected (trying to
read said malformed file), and does not fail if commit graph feature is
disabled.

Let me reorder files (BTW, is there a way for Git to put *.h files
before *.c files in diff?) for easier review:


diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);
  
+/*

+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
  struct tree *get_commit_tree_in_graph(const struct commit *c);
  
  struct commit_graph {

diff --git a/commit-graph.c b/commit-graph.c
index 047fa9fca5..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}

The comment in the header file commit-graph.h talks about filling
graph_pos and generation members of the given commit, but I don't see
filling graph_pos member here.


We 

Re: [PATCH v4 03/10] commit-graph: compute generation numbers

2018-05-01 Thread Derrick Stolee

On 4/29/2018 5:08 AM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

Sidenote: for generation numbers it does not matter if we use
depth-first or breadth-first strategy, but it is more natural to use
depth-first search because generation numbers need post-order processing
(parents before child).


The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

A question: what happens if the existing commit graph is from older
version of git and has _ZERO for generation numbers?

Answer: I see that we treat both _INFINITY (not in commit-graph) and
_ZERO (in commit graph but not computed) as not computed generation
numbers.  All right.


If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

All right, though I guess this would remain theoretical for a long
while.

We don't have any way of testing this, at least not without recompiling
Git with lower value of GENERATION_NUMBER_MAX -- which means not
automatically, isn't it?


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 45 +
  1 file changed, 45 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..047fa9fca5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
  
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY)

+   packedDate[0] |= htonl((*list)->generation << 2);
+

If we stumble upon commit marked as "not in commit-graph" while writing
commit graph, it is a BUG(), isn't it?

(Problem noticed by Junio.)


Since we are computing the values for all commits in the list, this 
condition is not important and will be removed.




It is a bit strange to me that the code uses get_be32 for reading, but
htonl for writing.  Is Git tested on non little-endian machines, like
big-endian ppc64 or s390x, or on mixed-endian machines (or
selectable-endian machines with data endianness set to non
little-endian, like ia64)?  If not, could we use for example openSUSE
Build Service (https://build.opensuse.org/) for this?


Since we are packing two values into 64 bits, I am using htonl() here to 
arrange the 30-bit generation number alongside the 34-bit commit date 
value, then writing with hashwrite(). The other 32-bit integers are 
written with hashwrite_be32() to avoid translating this data in-memory.





packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
  
@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)

}
  }
  
+static void compute_generation_numbers(struct commit** commits,

+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;

All right, commit_list will work as stack.


+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;

All right, we consider _INFINITY and _SERO as not computed.  If
generation number is computed (by 'recursion' or from commit graph), we
(re)use it.  This means that generation number calculation is
incremental, as intended -- good.


+
+   commit_list_insert(commits[i], );

Start depth-first walks from commits given.


+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;

Here all_parents_computed is a boolean flag.  I see that it is easier to
start with assumption that all parents will have computed generation
numbers.


+   uint32_t max_generation = 0;

The generation number value of 0 functions as sentinel; generation
numbers start from 1.  Not that it matters much, as lowest possible
generation number is 1, and we could have started from that value.


Except that for a commit with no parents, we want it to receive 
generation number max_generation + 1 = 1, so this value of 0 is important.





+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed =

Re: [PATCH v4 10/10] commit-graph.txt: update design document

2018-05-01 Thread Derrick Stolee

On 4/30/2018 7:32 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>

Looks good.


---
  Documentation/technical/commit-graph.txt | 30 +++-
  1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
  generation number and walk until reaching commits with known generation
  number.
  
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not

+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits,

The linux kernel commit graph has maximum of 513 commits sharing the
same generation number, but is is 5.43 commits sharing the same
generation number on average, with standard deviation 10.70; median is
even lower: it is 2, with 5.35 median absolute deviation (MAD).

So on average it would be a few extra commits.  Right.


   but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.

As I wrote before, handling those corner cases in more complicated, but
not that complicated.  We could simply use stronger condition if both
generation numbers are ordinary generation numbers, and weaker condition
when at least one generation number has one of those special values.


+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.

Ordinary generation numbers, where stronger condition holds, are those
between GENERATION_NUMBER_ZERO < gen(C) < GENERATION_NUMBER_MAX.


+
  Design Details
  --
  
@@ -98,17 +121,12 @@ Future Work

  - The 'commit-graph' subcommand does not have a "verify" mode that is
necessary for integration with fsck.
  
-- The file format includes room for precomputed generation numbers. These

-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-

Good.


  - After computing and storing generation numbers, we must make graph
walks aware of generation numbers to gain the performance benefits they
enable. This will mostly be accomplished by swapping a commit-date-ordered
priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
  
-- paint_down_to_common()

  - 'log --topo-order'

Another possible candidates:

- remove_redundant() - see comment in previous patch
- still_interesting() - where Git uses date slop to stop walking
  too far


remove_redundant() will be included in v5, thanks.

Instead of "still_interesting()" I'll add "git tag --merged" as the 
candidate to consider, as discussed in [1].


[1] https://public-inbox.org/git/87fu3g67ry@lant.ki.iif.hu/t/#u
    "branch --contains / tag --merged inconsistency"



  
  - Currently, parse_commit_gently() requires filling in the root tree

One important issue left is handling features that change view of
project history, and their interaction with commit-graph feature.

What would happen, if we turn on commit-graph feature, generate commit
graph file, and then:

   * use graft file or remove graft entries to cut history, or remove cut
 or join two [independent] histories.
   * use git-replace mechanims to do the same
   * in shallow clone, deepen or shorten the clone

What would happen if without re-generating commit-graph file (assuming
tha Git wouldn't do it f

Re: [PATCH v4 09/10] merge: check config before loading commits

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:54 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Sidenote: I wonder why reading config was postponed to later in the
command lifetime... I guess it was to avoid having to read config if
HEAD was invalid.


The 'branch' does need to be loaded before the call to git_config (as I 
found out after moving the config call too early), so I suppose it was 
natural to pair that with resolving head_commit.





Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

I guess this is because we avoid re-parsing objects at all costs; we
want to avoid re-reading commit graph too.


Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

I would prefer if this commit was put earlier in the series, to avoid
having broken Git (and thus a possibility of problems when bisecting) in
between those two commits.


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  builtin/merge.c | 7 ---
  t/t5318-commit-graph.sh | 9 +
  2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
  
-	init_diff_ui_defaults();

-   git_config(git_merge_config, NULL);
-

Good.


if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
  graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
  graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
  
+test_expect_success 'perform fast-forward merge in full repo' '

+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'

All right.  (though I wonder if this tests catches all problems where
BUG("bad generation skip") could have been encountered.


We will never know until we have this series running in the wild (and 
even then, some features are very obscure) and enough people turn on the 
config setting.


One goal of the "fsck and gc" series is to get this feature running 
during the rest of the test suite as much as possible, so we can get 
additional coverage. Also to get more experience from the community 
dogfooding the feature.





+
  test_done

Best,




Re: [PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:19 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

This is thanks to the fact that generation numbers start at zero (as
special case, though, with _ZERO), and we use strict inequality to avoid
handling _ZERO etc. in a special way.

As you wrote in response in previous version of this series, because
paint_down_to_common() is file-local, there is no need to come up with
symbolic name for GENERATION_NO_CUTOFF case.

All right.


For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

All right, and when using paint_down_to_common() to actually find merge
bases, and not only check for containment, we cannot use cutoff.
Therefore at least one call site needs to run it without functional
change... which we can do.  Good.


For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Nice.


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit.c | 18 ++
  1 file changed, 14 insertions(+), 4 deletions(-)

Let me reorder chunks a bit to make it easier to review.


diff --git a/commit.c b/commit.c
index 7bb007f56a..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return ret;
  
-	bases = paint_down_to_common(commit, nr_reference, reference);

+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
@@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
  }
  
  /* all input commits in one and twos[] must have been parsed! */

-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
  {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
  
  	one->object.flags |= PARENT1;

if (!n) {
@@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct commit_list *parents;
int flags;
  
+		if (commit->generation > last_gen)

+   BUG("bad generation skip");
+   last_gen = commit->generation;

Shouldn't we provide more information about where the problem is to the
user, to make it easier to debug the repository / commit-graph data?

Good to have this sanity check here.


This BUG() _should_ only be seen by developers who add callers which do 
not load commits from the commit-graph file. There is a chance that 
there are cases not covered by this patch and the added tests, though. 
Hopefully we catch them all by dogfooding the feature before turning it 
on by default.


I can add the following to help debug these bad situations:

+   BUG("bad generation skip %d > %d at %s",
+   commit->generation, last_gen,
+   oid_to_hex(>object.oid));






+
+   if (commit->generation < min_generation)
+   break;

So the reasoning for this, as far as I understand, is the following.
Please correct me if I am wrong.

The callsite with non-zero min_generation, in_merge_bases_many(), tries
to find out if "commit" is an ancestor of one of the "references".  At
least one of "references" is above "commit", so in_merge_bases_many()
uses paint_down_to_common() - but is interested only if "commit" was
painted as reachable from one of "references".

Thus we can interrupt the walk if we know that none of [considered]
commits in the queue can reach "commit"/"one" - as if they were all
STALE.

The

Re: [PATCH 0/9] get_short_oid UI improvements

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

I started out just wanting to do 04/09 so I'd get prettier output, but
then noticed that ^{tag}, ^{commit}< ^{blob} and ^{tree} didn't behave
as expected with the disambiguation output, and that core.disambiguate
had never been documented.

Ævar Arnfjörð Bjarmason (9):
   sha1-name.c: remove stray newline
   sha1-array.h: align function arguments
   sha1-name.c: move around the collect_ambiguous() function
   get_short_oid: sort ambiguous objects by type, then SHA-1
   get_short_oid: learn to disambiguate by ^{tag}
   get_short_oid: learn to disambiguate by ^{blob}
   get_short_oid / peel_onion: ^{tree} should mean tree, not treeish
   get_short_oid / peel_onion: ^{tree} should mean commit, not commitish
   config doc: document core.disambiguate

  Documentation/config.txt| 14 ++
  cache.h |  5 ++-
  sha1-array.c| 15 +++
  sha1-array.h|  7 ++-
  sha1-name.c | 69 -
  t/t1512-rev-parse-disambiguation.sh | 32 ++---
  6 files changed, 120 insertions(+), 22 deletions(-)



This is a good series. Please take a look at my suggestion in Patch 4/9, 
but feel free to keep this series as written.


Reviewed-by: Derrick Stolee <dsto...@microsoft.com>


Re: [PATCH 4/9] get_short_oid: sort ambiguous objects by type, then SHA-1

2018-05-01 Thread Derrick Stolee

On 4/30/2018 6:07 PM, Ævar Arnfjörð Bjarmason wrote:

Change the output emitted when an ambiguous object is encountered so
that we show tags first, then commits, followed by trees, and finally
blobs. Within each type we show objects in hashcmp(). Before this
change the objects were only ordered by hashcmp().

The reason for doing this is that the output looks better as a result,
e.g. the v2.17.0 tag before this change on "git show e8f2" would
display:

 hint: The candidates are:
 hint:   e8f2093055 tree
 hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
 hint:   e8f21d02f7 blob
 hint:   e8f21d577c blob
 hint:   e8f25a3a50 tree
 hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
 hint:   e8f2650052 tag v2.17.0
 hint:   e8f2867228 blob
 hint:   e8f28d537c tree
 hint:   e8f2a35526 blob
 hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
 hint:   e8f2cf6ec0 tree

Now we'll instead show:

 hint:   e8f2650052 tag v2.17.0
 hint:   e8f21caf94 commit 2013-06-24 - bash prompt: print unique detached 
HEAD abbreviated object name
 hint:   e8f26250fa commit 2017-02-03 - Merge pull request #996 from 
jeffhostetler/jeffhostetler/register_rename_src
 hint:   e8f2bc0c06 commit 2015-05-10 - Documentation: note behavior for 
multiple remote.url entries
 hint:   e8f2093055 tree
 hint:   e8f25a3a50 tree
 hint:   e8f28d537c tree
 hint:   e8f2cf6ec0 tree
 hint:   e8f21d02f7 blob
 hint:   e8f21d577c blob
 hint:   e8f2867228 blob
 hint:   e8f2a35526 blob

Since we show the commit data in the output that's nicely aligned once
we sort by object type. The decision to show tags before commits is
pretty arbitrary, but it's much less likely that we'll display a tag,
so if there is one it makes sense to show it first.


Here's a non-arbitrary reason: the object types are ordered 
topologically (ignoring self-references):


tag -> commit, tree, blob
commit -> tree
tree -> blob


Signed-off-by: Ævar Arnfjörð Bjarmason 
---
  sha1-array.c | 15 +++
  sha1-array.h |  3 +++
  sha1-name.c  | 37 -
  3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/sha1-array.c b/sha1-array.c
index 838b3bf847..48bd9e9230 100644
--- a/sha1-array.c
+++ b/sha1-array.c
@@ -41,6 +41,21 @@ void oid_array_clear(struct oid_array *array)
array->sorted = 0;
  }
  
+

+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data)
+{
+   int i;
+
+   for (i = 0; i < array->nr; i++) {
+   int ret = fn(array->oid + i, data);
+   if (ret)
+   return ret;
+   }
+   return 0;
+}
+
  int oid_array_for_each_unique(struct oid_array *array,
for_each_oid_fn fn,
void *data)
diff --git a/sha1-array.h b/sha1-array.h
index 1e1d24b009..232bf95017 100644
--- a/sha1-array.h
+++ b/sha1-array.h
@@ -16,6 +16,9 @@ void oid_array_clear(struct oid_array *array);
  
  typedef int (*for_each_oid_fn)(const struct object_id *oid,

   void *data);
+int oid_array_for_each(struct oid_array *array,
+  for_each_oid_fn fn,
+  void *data);
  int oid_array_for_each_unique(struct oid_array *array,
  for_each_oid_fn fn,
  void *data);
diff --git a/sha1-name.c b/sha1-name.c
index 9d7bbd3e96..46d8b1afa6 100644
--- a/sha1-name.c
+++ b/sha1-name.c
@@ -378,6 +378,34 @@ static int collect_ambiguous(const struct object_id *oid, 
void *data)
return 0;
  }
  
+static int sort_ambiguous(const void *a, const void *b)

+{
+   int a_type = oid_object_info(a, NULL);
+   int b_type = oid_object_info(b, NULL);
+   int a_type_sort;
+   int b_type_sort;
+
+   /*
+* Sorts by hash within the same object type, just as
+* oid_array_for_each_unique() would do.
+*/
+   if (a_type == b_type)
+   return oidcmp(a, b);
+
+   /*
+* Between object types show tags, then commits, and finally
+* trees and blobs.
+*
+* The object_type enum is commit, tree, blob, tag, but we
+* want tag, commit, tree blob. Cleverly (perhaps too
+* cleverly) do that with modulus, since the enum assigns 1 to
+* commit, so tag becomes 0.
+*/


I appreciate this comment. Clever things should be marked as such.


+   a_type_sort = a_type % 4;
+   b_type_sort = b_type % 4;
+   return a_type_sort > b_type_sort ? 1 : -1;
+}
+
  static int get_short_oid(const char *name, int len, struct object_id *oid,
  unsigned flags)
  {
@@ -409,6 +437,8 @@ 

Re: branch --contains / tag --merged inconsistency

2018-04-30 Thread Derrick Stolee

On 4/27/2018 12:03 PM, SZEDER Gábor wrote:

Szia Feri,


I'm moving the IRC discussion here, because this might be a bug report
in the end.  So, kindly try these steps (103 MB free space required):

$ git clone https://github.com/ClusterLabs/pacemaker.git && cd pacemaker
[...]
$ git branch --contains Pacemaker-0.6.1
* master
$ git tag --merged master | fgrep Pacemaker-0.6
Pacemaker-0.6.0
Pacemaker-0.6.2
Pacemaker-0.6.3
Pacemaker-0.6.4
Pacemaker-0.6.5
Pacemaker-0.6.6

Notice that Pacemaker-0.6.1 is missing from the output.  Kind people on
IRC didn't find a quick explanation, and we all had to go eventually.
Is this expected behavior?  Reproduced with git 2.11.0 and 2.17.0.

The commit pointed to by the tag Pacemaker-0.6.1 and its parent have a
serious clock skew, i.e. they are a few months older than their parents:

$ git log --format='%h %ad %cd%d%n%s' --date=short 
Pacemaker-0.6.1^..47a8ef4c
47a8ef4ce 2008-02-15 2008-02-15
 Low: TE: Logging - display the op's magic field for unexpected and foreign 
events
b9cfcd6b4 2007-12-10 2007-12-10 (tag: Pacemaker-0.6.2)
 haresources2cib.py: set default-action-timeout to the default (20s)
52e7793e0 2007-12-10 2007-12-10
 haresources2cib.py: update ra parameters lists
dea277271 2008-02-14 2008-02-14
 Medium: Build: Turn on snmp support in rpm packages (patch from MATSUDA, 
Daiki)
f418742fe 2008-02-14 2008-02-14
 Low: Build: Update the .spec file with the one used by build service
ccfa716a5 2008-02-14 2008-02-14
 Medium: SNMP: Allow the snmp subagent to be built (patch from MATSUDA, 
Daiki)
50f0ade2d 2008-02-14 2008-02-14
 Low: Build: Update last release number
90f11667f 2008-02-14 2008-02-14
 Medium: Tools: Make sure the autoconf variables in haresources2cib are 
expanded
9d2383c46 2008-02-11 2008-02-11 (tag: Pacemaker-0.6.1)
 High: cib: Ensure the archived file hits the disk before returning

(branch|tag|describe|...) (--contains|--merged) use the commit timestamp
information as a heuristic to avoid traversing parts of the history,
which makes these operations, especially on big histories, an order of
magnitude or two faster.  Yeah, commit timestamps can't always be
trusted, but skewed commits are rare, and skewed commits with this much
skew are even rarer.

I'm not sure how (or if it's at all possible) to turn off this
timestamp-based optimisation.


This is actually a bit more complicated. The "--merged" check in 'git 
tag' uses a different mechanism to detect which tags are reachable. It 
uses a revision walk starting at the "merge commit" (master in your 
case) and all tags with the "limited" option (to ensure the walk happens 
during prepare_revision_walk()) but marks the merge commit as 
UNINTERESTING. The limit_list() method stops when all commits are marked 
UNINTERESTING - minus some "slop" related to the commits that start the 
walk.


One important note: the set of tags is important here. If you add a new 
tag to the root commit (git tag MyTag a2d71961f) then the walk succeeds 
by ensuring it walks until MyTag. This gets around the clock skew issue. 
There may be other more-recent tags with a clock-skew issue, but since 
Pacemaker-0.6.0 is the oldest tag, that requires the walk to continue 
until at least that date.


The commit-walk machinery in revision.c is rather complicated, and is 
used for a lot of different reasons, such as "git log" and this 
application in "git tag". It is on my list to refactor this code to use 
the commit-graph and generation numbers, but as we can see by this 
example, it is not easy to tease out what is happening in the code.


In a world where generation numbers are expected to be available, we 
could rewrite do_merge_filter() in ref-filter.c to call into 
paint_down_to_common() in commit.c using the new "min_generation" 
marker. By assigning the tags to be in the "twos" list and the merge 
commit in the "one" commit, we can check if the tags have the PARENT1 
flag after the walk in paint_down_to_common(). Due to the static nature 
of paint_down_to_common(), we will likely want to abstract this into an 
external method in commit.c, say can_reach_many(struct commit *from, 
struct commit_list *to).



FWIW, much work is being done on a cached commit graph including commit
generation numbers, which will solve this issue both correctly and more
efficiently.  Perhaps it will already be included in the next release.


The work in ds/generation-numbers is focused on the "git tag --contains" 
method, which does return correctly here (it is the reverse of the 
--merged condition):


Which tags can reach Pacemaker-0.6.1?

$ git tag --contains Pacemaker-0.6.1
(returns a big list)

This is the actual reverse lookup (which branches contain this tag?)

$ git branch --contains Pacemaker-0.6.1 | grep master
* master

These commands work despite clock skew. The commit-graph feature makes 
them faster.


Thanks,
-Stolee



Re: [PATCH v4 02/10] commit: add generation number to struct commmit

2018-04-30 Thread Derrick Stolee

On 4/28/2018 6:35 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
   more than the maximum generation number among the parents of A.

Very minor nitpick: it would be more readable wrapped differently:

   * If a commit A has parents, then the generation number of A is
 one more than the maximum generation number among parents of A.

Very minor nitpick: possibly "parents", not "the parents", but I am
not native English speaker.


Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Good explanation; I wonder if we want to have it in some shortened form
also in comments, and not only in the commit message.


Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  alloc.c| 1 +
  commit-graph.c | 2 ++
  commit.h   | 4 
  3 files changed, 7 insertions(+)

I have reordered patches to make it easier to review.


diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
  #include "pretty.h"
  
  #define COMMIT_NOT_FROM_GRAPH 0x

+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0

I wonder if it wouldn't be good to have some short in-line comments
explaining those constants, or a block comment above them.

  
  struct commit_list {

struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
  };
  
  extern int save_commit_buffer;

All right, simple addition of the new field.  Nothing to go wrong here.

Sidenote: With 0x7FFF being (if I am not wrong) maximum graph_pos
and maximum number of nodes in commit graph, we won't hit 0x3FFF
generation number limit for all except very, very linear histories.


Both of these limits are far away from being realistic. But we could 
extend the maximum graph_pos independently from the maximum generation 
number now that we have the "capped" logic.





diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
  }

All right, start with initializing it with "not from commit-graph" value
after allocation.

  
diff --git a/commit-graph.c b/commit-graph.c

index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
  
+	item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

+

I guess we should not worry about these "magical constants" sprinkled
here, like "+ 8" above.

Let's examine how it goes, taking a look at commit-graph-format.txt
in Documentation/technical/commit-graph-format.txt

  * The first H (g->hash_len) bytes are for the OID of the root tree.
  * The next 8 bytes are for the positions of the first two parents [...]

So 'commit_data + g->hash_len + 8' is our offset from the start of
commit data.  All right.

   * The next 8 bytes store the generation number of the commit and
 the commit time in seconds since EPOCH.  The generation number
 uses the higher 30 bits of the first 4 bytes. [...]

The higher 30 bits of the 4 bytes, which is 32 bits, means that we need
to shift 32-bit value 2 bits right, so that we get lower 30 bits of
32-bit value.  All right.

   All 4-byte numbers are in network order.

Shouldn't it be ntohl() to convert from network order to host order, and
not get_be32()?  I guess they are the same (network order is big-endian
order), and get_be32() is what rest of git uses...


ntohl() takes a 32-bit value, while get_be32() takes a pointer. This 
makes pulling network-bytes out of streams much cleaner with get_be32(), 
so I try to use that whenever possible.




Re: [PATCH] coccinelle: avoid wrong transformation suggestions from commit.cocci

2018-04-30 Thread Derrick Stolee

On 4/30/2018 5:31 AM, SZEDER Gábor wrote:

The semantic patch 'contrib/coccinelle/commit.cocci' added in
2e27bd7731 (treewide: replace maybe_tree with accessor methods,
2018-04-06) is supposed to "ensure that all references to the
'maybe_tree' member of struct commit are either mutations or accesses
through get_commit_tree()".  So get_commit_tree() clearly must be able
to directly access the 'maybe_tree' member, and 'commit.cocci' has a
bit of a roundabout workaround to ensure that get_commit_tree()'s
direct access in its return statement is not transformed: after all
references to 'maybe_tree' have been transformed to a call to
get_commit_tree(), including the reference in get_commit_tree()
itself, the last rule transforms back a 'return get_commit_tree()'
statement, back then found only in get_commit_tree() itself, to a
direct access.

Unfortunately, already the very next commit shows that this workaround
is insufficient: 7b8a21dba1 (commit-graph: lazy-load trees for
commits, 2018-04-06) extends get_commit_tree() with a condition
directly accessing the 'maybe_tree' member, and Coccinelle with
'commit.cocci' promptly detects it and suggests a transformation to
avoid it.  This transformation is clearly wrong, because calling
get_commit_tree() to access 'maybe_tree' _in_ get_commit_tree() would
obviously lead to recursion.  Furthermore, the same commit added
another, more specialized getter function get_commit_tree_in_graph(),
whose legitimate direct access to 'maybe_tree' triggers a similar
wrong transformation suggestion.


Thanks for catching this, Szeder. Sorry for the noise.


Exclude both of these getter functions from the general rule in
'commit.cocci' that matches their direct accesses to 'maybe_tree'.
Also exclude load_tree_for_commit(), which, as static helper funcion
of get_commit_tree_in_graph(), has legitimate direct access to
'maybe_tree' as well.


This is an interesting feature of Coccinelle. Happy to learn it.


The last rule transforming back 'return get_commit_tree()' statements
to direct accesses thus became unnecessary, remove it.

Signed-off-by: SZEDER Gábor <szeder@gmail.com>


I applied this locally on 'next' and ran the check. I succeeded with no 
changes.


Thanks!

Reviewed-by: Derrick Stolee <dsto...@microsoft.com>



---
  contrib/coccinelle/commit.cocci | 10 --
  1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/contrib/coccinelle/commit.cocci b/contrib/coccinelle/commit.cocci
index ac38525941..a7e9215ffc 100644
--- a/contrib/coccinelle/commit.cocci
+++ b/contrib/coccinelle/commit.cocci
@@ -10,11 +10,15 @@ expression c;
  - c->maybe_tree->object.oid.hash
  + get_commit_tree_oid(c)->hash
  
+// These excluded functions must access c->maybe_tree direcly.

  @@
+identifier f !~ 
"^(get_commit_tree|get_commit_tree_in_graph|load_tree_for_commit)$";
  expression c;
  @@
+  f(...) {...
  - c->maybe_tree
  + get_commit_tree(c)
+  ...}
  
  @@

  expression c;
@@ -22,9 +26,3 @@ expression s;
  @@
  - get_commit_tree(c) = s
  + c->maybe_tree = s
-
-@@
-expression c;
-@@
-- return get_commit_tree(c);
-+ return c->maybe_tree;




Re: [PATCH v4 03/10] commit-graph: compute generation numbers

2018-04-26 Thread Derrick Stolee



On 4/26/2018 8:58 AM, Derrick Stolee wrote:

n 4/25/2018 10:35 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:

@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct 
hashfile *f, int hash_len,

  else
  packedDate[0] = 0;
  +    if ((*list)->generation != GENERATION_NUMBER_INFINITY)
+    packedDate[0] |= htonl((*list)->generation << 2);
+
  packedDate[1] = htonl((*list)->date);
  hashwrite(f, packedDate, 8);

The ones that have infinity are written as zero here.  The code that
reads the generation field off of a file in fill_commit_graph_info()
and fill_commit_in_graph() both leave such a record in file as-is,
so the reader of what we write out will think it is _ZERO, not _INF.

Not that it matters, as it seems that most of the code being added
by this series treat _ZERO and _INF more or less interchangeably.
But it does raise another question, i.e. do we need both _ZERO and
_INF, or is it sufficient to have just a single _UNKNOWN?


This code is confusing. The 'if' condition is useless, since at this 
point every commit should be finite (since we computed generation 
numbers for everyone). We should just write the value always.


For the sake of discussion, the value _INFINITY means not in the graph 
and _ZERO means in the graph without a computed generation number. 
It's a small distinction, but it gives a single boundary to use for 
reachability queries that test generation number.




@@ -571,6 +574,46 @@ static void close_reachable(struct 
packed_oid_list *oids)

  }
  }
  +static void compute_generation_numbers(struct commit** commits,
+   int nr_commits)
+{
+    int i;
+    struct commit_list *list = NULL;
+
+    for (i = 0; i < nr_commits; i++) {
+    if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+    commits[i]->generation != GENERATION_NUMBER_ZERO)
+    continue;
+
+    commit_list_insert(commits[i], );
+    while (list) {
+    struct commit *current = list->item;
+    struct commit_list *parent;
+    int all_parents_computed = 1;
+    uint32_t max_generation = 0;
+
+    for (parent = current->parents; parent; parent = 
parent->next) {
+    if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+    parent->item->generation == 
GENERATION_NUMBER_ZERO) {

+    all_parents_computed = 0;
+    commit_list_insert(parent->item, );
+    break;
+    } else if (parent->item->generation > 
max_generation) {

+    max_generation = parent->item->generation;
+    }
+    }
+
+    if (all_parents_computed) {
+    current->generation = max_generation + 1;
+    pop_commit();
+    }

If we haven't computed all parents' generations yet,
current->generation is undefined (or at least "left as
initialized"), so it does not make much sense to attempt to clip it
at _MAX at this point.  At leat not yet.

IOW, shouldn't the following two lines be inside the "we now know
genno of all parents, so we can compute genno for commit" block
above?


You're right! Good catch. This code sets every merge commit to _MAX. 
It should be in the block above.





+    if (current->generation > GENERATION_NUMBER_MAX)
+    current->generation = GENERATION_NUMBER_MAX;
+    }
+    }


This bothered me: why didn't I catch a bug here? I rebased my "fsck" RFC 
onto this branch and it succeeded. Then, I realized that this does not 
actually write incorrect values, since we re-visit this commit again 
after we pop the stack down to this commit. However, there is time in 
the middle where we have set the generation (in memory) incorrectly and 
that could easily turn into a real bug by a later change.


I'll stick the _MAX check in the if above to prevent confusion.

Thanks,
-Stolee



Re: [PATCH v4 03/10] commit-graph: compute generation numbers

2018-04-26 Thread Derrick Stolee

n 4/25/2018 10:35 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
  
+		if ((*list)->generation != GENERATION_NUMBER_INFINITY)

+   packedDate[0] |= htonl((*list)->generation << 2);
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);

The ones that have infinity are written as zero here.  The code that
reads the generation field off of a file in fill_commit_graph_info()
and fill_commit_in_graph() both leave such a record in file as-is,
so the reader of what we write out will think it is _ZERO, not _INF.

Not that it matters, as it seems that most of the code being added
by this series treat _ZERO and _INF more or less interchangeably.
But it does raise another question, i.e. do we need both _ZERO and
_INF, or is it sufficient to have just a single _UNKNOWN?


This code is confusing. The 'if' condition is useless, since at this 
point every commit should be finite (since we computed generation 
numbers for everyone). We should just write the value always.


For the sake of discussion, the value _INFINITY means not in the graph 
and _ZERO means in the graph without a computed generation number. It's 
a small distinction, but it gives a single boundary to use for 
reachability queries that test generation number.





@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
}
  }
  
+static void compute_generation_numbers(struct commit** commits,

+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }

If we haven't computed all parents' generations yet,
current->generation is undefined (or at least "left as
initialized"), so it does not make much sense to attempt to clip it
at _MAX at this point.  At leat not yet.

IOW, shouldn't the following two lines be inside the "we now know
genno of all parents, so we can compute genno for commit" block
above?


You're right! Good catch. This code sets every merge commit to _MAX. It 
should be in the block above.





+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = GENERATION_NUMBER_MAX;
+   }
+   }
+}
+
  void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
  
+	compute_generation_numbers(commits.list, commits.nr);

+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);


Re: What's cooking in git.git (Apr 2018, #03; Wed, 25)

2018-04-26 Thread Derrick Stolee

On 4/25/2018 1:43 PM, Brandon Williams wrote:

On 04/25, Ævar Arnfjörð Bjarmason wrote:

* bw/protocol-v2 (2018-03-15) 35 commits
   (merged to 'next' on 2018-04-11 at 23ee234a2c)
  + remote-curl: don't request v2 when pushing
  + remote-curl: implement stateless-connect command
  + http: eliminate "# service" line when using protocol v2
  + http: don't always add Git-Protocol header
  + http: allow providing extra headers for http requests
  + remote-curl: store the protocol version the server responded with
  + remote-curl: create copy of the service name
  + pkt-line: add packet_buf_write_len function
  + transport-helper: introduce stateless-connect
  + transport-helper: refactor process_connect_service
  + transport-helper: remove name parameter
  + connect: don't request v2 when pushing
  + connect: refactor git_connect to only get the protocol version once
  + fetch-pack: support shallow requests
  + fetch-pack: perform a fetch using v2
  + upload-pack: introduce fetch server command
  + push: pass ref prefixes when pushing
  + fetch: pass ref prefixes when fetching
  + ls-remote: pass ref prefixes when requesting a remote's refs
  + transport: convert transport_get_remote_refs to take a list of ref prefixes
  + transport: convert get_refs_list to take a list of ref prefixes
  + connect: request remote refs using v2
  + ls-refs: introduce ls-refs server command
  + serve: introduce git-serve
  + test-pkt-line: introduce a packet-line test helper
  + protocol: introduce enum protocol_version value protocol_v2
  + transport: store protocol version
  + connect: discover protocol version outside of get_remote_heads
  + connect: convert get_remote_heads to use struct packet_reader
  + transport: use get_refs_via_connect to get refs
  + upload-pack: factor out processing lines
  + upload-pack: convert to a builtin
  + pkt-line: add delim packet support
  + pkt-line: allow peeking a packet line without consuming it
  + pkt-line: introduce packet_read_with_status
  (this branch is used by bw/server-options.)

  The beginning of the next-gen transfer protocol.

  Will cook in 'next'.

With a month & 10 days of no updates & this looking stable it would be
great to have it in master sooner than later to build on top of it in
the 2.18 window.

I personally think that this series is ready to graduate to master but I
mentioned to Junio off-list that it might be a good idea to wait until
it has undergone a little more stress testing.  We've been in the
process of trying to get this rolled out to our internal server but due
to a few roadblocks and people being out of the office its taken a bit
longer than I would have liked to get it up and running.  I'm hoping in
another few days/a week I'll have some more data on live traffic.  At
that point I think I'll be more convinced that it will be safe to merge it.

I may be overly cautions but I'm hoping that we can get this right
without needing to do another protocol version bump for a very long
time.  Technically using v2 is hidden behind an "experimental" config
flag so we should still be able to tweak it after the fact if we
absolutely needed to, but I'd like to avoid that if necessary.


There's no testing better than production. I think if we have an 
opportunity to test with realistic traffic, we should take advantage.


But I also agree that this series looks stable.

I realize it's hard to communicate both "this series is ready to merge" 
and "I appreciate your caution."


Thanks,

-Stolee



Re: [PATCH v4 00/10] Compute and consume generation numbers

2018-04-25 Thread Derrick Stolee

As promised, here is the diff from v3.

Thanks,
-Stolee

-- >8 --

diff --git a/builtin/merge.c b/builtin/merge.c
index 7e1da6c6ea..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,6 +1148,7 @@ int cmd_merge(int argc, const char **argv, const 
char *prefix)
    branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, 
NULL);

    if (branch)
    skip_prefix(branch, "refs/heads/", );
+
    init_diff_ui_defaults();
    git_config(git_merge_config, NULL);

@@ -1156,7 +1157,6 @@ int cmd_merge(int argc, const char **argv, const 
char *prefix)

    else
    head_commit = lookup_commit_or_die(_oid, "HEAD");

-
    if (branch_mergeoptions)
    parse_branch_merge_options(branch_mergeoptions);
    argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -257,7 +257,7 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin

    uint32_t *parent_data_ptr;
    uint64_t date_low, date_high;
    struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
(g->hash_len + 16) * pos;


    item->object.parsed = 1;
    item->graph_pos = pos;
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin

    *pos = item->graph_pos;
    return 1;
    } else {
-   return bsearch_graph(commit_graph, &(item->object.oid), 
pos);

+   return bsearch_graph(g, &(item->object.oid), pos);
    }
 }

@@ -312,10 +312,10 @@ int parse_commit_in_graph(struct commit *item)
 {
    uint32_t pos;

-   if (item->object.parsed)
-   return 0;
    if (!core_commit_graph)
    return 0;
+   if (item->object.parsed)
+   return 1;
    prepare_commit_graph();
    if (commit_graph && find_commit_in_graph(item, commit_graph, ))
    return fill_commit_in_graph(item, commit_graph, pos);
@@ -454,9 +454,8 @@ static void write_graph_chunk_data(struct hashfile 
*f, int hash_len,

    else
    packedDate[0] = 0;

-   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY)
    packedDate[0] |= htonl((*list)->generation << 2);
-   }

    packedDate[1] = htonl((*list)->date);
    hashwrite(f, packedDate, 8);
diff --git a/commit.c b/commit.c
index 9ef6f699bd..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -653,7 +653,7 @@ int compare_commits_by_gen_then_commit_date(const 
void *a_, const void *b_, void

    else if (a->generation > b->generation)
    return -1;

-   /* use date as a heuristic when generataions are equal */
+   /* use date as a heuristic when generations are equal */
    if (a->date < b->date)
    return 1;
    else if (a->date > b->date)
@@ -1078,7 +1078,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *

    }

    if (commit->generation > min_generation)
-   return 0;
+   return ret;

    bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

    if (commit->object.flags & PARENT2)
diff --git a/ref-filter.c b/ref-filter.c
index e2fea6d635..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"

 static struct ref_msg {
    const char *gone;
@@ -1582,7 +1583,7 @@ static int in_commit_list(const struct commit_list 
*want, struct commit *c)

 }

 /*
- * Test whether the candidate or one of its parents is contained in the 
list.

+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
@@ -1629,7 +1630,7 @@ static enum contains_result 
contains_tag_algo(struct commit *candidate,


    for (p = want; p; p = p->next) {
    struct commit *c = p->item;
-   parse_commit_or_die(c);
+   load_commit_graph_info(c);
    if (c->generation < cutoff)
    cutoff = c->generation;
    }




[PATCH v4 09/10] merge: check config before loading commits

2018-04-25 Thread Derrick Stolee
Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/merge.c | 7 ---
 t/t5318-commit-graph.sh | 9 +
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..b819756946 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,14 +1148,15 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
 
-   init_diff_ui_defaults();
-   git_config(git_merge_config, NULL);
-
if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
argc = parse_options(argc, argv, prefix, builtin_merge_options,
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



[PATCH v4 10/10] commit-graph.txt: update design document

2018-04-25 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++-
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-- paint_down_to_common()
 - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb



[PATCH v4 04/10] commit: use generations in paint_down_to_common()

2018-04-25 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 20 +++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..4d00b0a1d6 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generations are equal */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.39.g685157f7fb



[PATCH v4 07/10] commit: use generation numbers for in_merge_bases()

2018-04-25 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the initial commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 39a3749abd..7bb007f56a 100644
--- a/commit.c
+++ b/commit.c
@@ -1056,12 +1056,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return ret;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb



[PATCH v4 05/10] commit-graph: always load commit-graph information

2018-04-25 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 45 ++---
 commit-graph.h |  8 
 commit.c   |  7 +--
 commit.h   |  2 +-
 object.c   |  2 +-
 sha1_file.c|  2 +-
 6 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 047fa9fca5..aebd242def 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,6 +245,12 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph 
*g, uint32_t pos)
+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
uint32_t edge_value;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t *pos)
+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(g, &(item->object.oid), pos);
+   }
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+   uint32_t pos;
+
if (!core_commit_graph)
return 0;
if (item->object.parsed)
return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, 
&(item->object.oid), );
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
 {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index 4d00b0a1d6..39a3749abd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, 
unsigned long *sizep)
return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size)
+int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph)
 {
const char *tail = buffer;
const char *bufptr = buffer;
@@ -386,6 +386,9 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
}
item->date = parse_commit_date(bufptr, tail);
 
+   if (check_graph)
+   load_commit_graph_info(item);
+
return 0;
 }
 
@@ -412,7 +415,7 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return error("

[PATCH v4 08/10] commit: add short-circuit to paint_down_to_common()

2018-04-25 Thread Derrick Stolee
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 7bb007f56a..e2e16ea1a7 100644
--- a/commit.c
+++ b/commit.c
@@ -808,11 +808,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
 {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -831,6 +834,13 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip");
+   last_gen = commit->generation;
+
+   if (commit->generation < min_generation)
+   break;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -879,7 +889,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
 
-   list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
 
while (list) {
struct commit *commit = pop_commit();
@@ -946,7 +956,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1070,7 +1080,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return ret;
 
-   bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb



[PATCH v4 06/10] ref-filter: use generation number for --contains

2018-04-25 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is strictly larger than the generation number of B. This condition
allows significantly short-circuiting commit-graph walks.

Use generation number for '--contains' type queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index aff24d93be..fb35067fc9 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -16,6 +16,7 @@
 #include "trailer.h"
 #include "wt-status.h"
 #include "commit-slab.h"
+#include "commit-graph.h"
 
 static struct ref_msg {
const char *gone;
@@ -1587,7 +1588,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1605,10 @@ static enum contains_result contains_test(struct commit 
*candidate,
 
/* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1624,18 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   load_commit_graph_info(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1637,7 +1653,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1651,7 +1667,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb



[PATCH v4 02/10] commit: add generation number to struct commmit

2018-04-25 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 4 
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb



[PATCH v4 00/10] Compute and consume generation numbers

2018-04-25 Thread Derrick Stolee
Thanks for the feedback on the previous version. I think this series is
stabilizing nicely. I'll reply to this message with an inter-diff as it
is not too large to share but would clutter this cover letter.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is built on ds/lazy-load-trees.

Derrick Stolee (10):
  ref-filter: fix outdated comment on in_commit_list
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph: always load commit-graph information
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  merge: check config before loading commits
  commit-graph.txt: update design document

 Documentation/technical/commit-graph.txt | 30 ++--
 alloc.c  |  1 +
 builtin/merge.c  |  7 +-
 commit-graph.c   | 92 
 commit-graph.h   |  8 +++
 commit.c | 54 +++---
 commit.h |  7 +-
 object.c |  2 +-
 ref-filter.c | 26 +--
 sha1_file.c  |  2 +-
 t/t5318-commit-graph.sh  |  9 +++
 11 files changed, 198 insertions(+), 40 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb



[PATCH v4 03/10] commit-graph: compute generation numbers

2018-04-25 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 45 +
 1 file changed, 45 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..047fa9fca5 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,9 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY)
+   packedDate[0] |= htonl((*list)->generation << 2);
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -571,6 +574,46 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }
+
+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = GENERATION_NUMBER_MAX;
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +737,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers(commits.list, commits.nr);
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb



[PATCH v4 01/10] ref-filter: fix outdated comment on in_commit_list

2018-04-25 Thread Derrick Stolee
The in_commit_list() method does not check the parents of
the candidate for containment in the list. Fix the comment
that incorrectly states that it does.

Reported-by: Jakub Narebski <jna...@gmail.com>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..aff24d93be 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1582,7 +1582,7 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
 }
 
 /*
- * Test whether the candidate or one of its parents is contained in the list.
+ * Test whether the candidate is contained in the list.
  * Do not recurse to find out, though, but return -1 if inconclusive.
  */
 static enum contains_result contains_test(struct commit *candidate,
-- 
2.17.0.39.g685157f7fb



Re: [PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-25 Thread Derrick Stolee

On 4/24/2018 2:56 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:

One way to fix this is to call 'prepare_commit_graph()' directly and
then test that 'commit_graph' is non-null before performing any
parses. I'm not thrilled with how that couples the commit-graph
implementation to this feature, but that may be necessary to avoid
regressions in the non-commit-graph case.

Another possible solution (not sure if better or worse) would be to
change the signature of contains_tag_algo() function to take parameter
or flag that would decide whether to parse "wants".


If I reorder commits so "commit-graph:always load commit-graph 
information" is before this one, then we can call 
load_commit_graph_info() which just fills the generation and graph_pos 
information. This will keep the coupling very light, instead of needing 
to call prepare_commit_graph() or checking the config setting.


Thanks,
-Stolee


Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-24 Thread Derrick Stolee

On 4/23/2018 5:38 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:


On 4/18/2018 7:19 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


[...]

[...], and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.


If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju@gmail.com/

Do I have that right?  If so, it looks all right to me.

Yes, the priority queue needs to compare via generation number first
or there will be errors. This is why we could not use commit time
before.

I was more concerned about getting right the order in the priority queue
(does it return minimal or maximal generation number).

I understand that the cutoff could not be used without generation
numbers because of the possibility of clock skew - using cutoff on dates
could lead to wrong results.


Maximal generation number is important so we do not visit commits 
multiple times (say, once with PARENT1 set, and a second time when 
PARENT2 is set). A minimal generation number order would create a DFS 
order and walk until the cutoff every time.


In cases without clock skew, maximal generation number order will walk 
the same set of commits as maximal commit time; the order may differ, 
but only between incomparable commits.



For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

[...]

flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
   -list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
while (list) {
struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
   -bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.

We need to change the signature some way, but maybe the way I chose is
not the best.

No, after taking longer I think the new signature is a good choice.


To elaborate: paint_down_to_common() is used for multiple
purposes. The caller here that supplies 'commit->generation' is used
only to compute reachability (by testing if the flag PARENT2 exists on
the commit, then clears all flags). The other callers expect the full
walk down to the common commits, and keeps those PARENT1, PARENT2, and
STALE flags for future use (such as reporting merge bases). Usually
the call to paint_down_to_common() is followed by a revision walk that
only halts when reaching root commits or commits with both PARENT1 and
PARENT2 flags on, so always short-circuiting on generations would
break the functionality; this is confirmed by the
t5318-commit-graph.sh.

Right.

I have realized that just after sending the email.  I'm sorry about this.


An alternative to the signature change is to add a boolean parameter
"use_cutoff" or something, that specifies "don't walk beyond the
commit". This may give a more of a clear description of what it will
do with the generation value, but since we are 

Re: [PATCH v3 0/9] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee

On 4/18/2018 8:04 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains', providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
   commit: add generation number to struct commmit

Nice and short patch. Looks good to me.


   commit-graph: compute generation numbers

Another quite easy to understand patch. LGTM.


   commit: use generations in paint_down_to_common()

Nice and short patch; minor typo in comment in code.
Otherwise it looks good to me.


   commit-graph.txt: update design document

I see that diagram got removed in this version; maybe it could be
replaced with relationship table?

Anyway, it looks good to me.


The diagrams and tables seemed to cause more confusion than clarity. I 
think the reader should create their own mental model from the 
definitions and description and we should avoid trying to make a summary.





   ref-filter: use generation number for --contains

A question: how performance looks like after the change if commit-graph
is not available?


The performance issue is minor, but will be fixed in v4.




   commit: use generation numbers for in_merge_bases()

Possible typo in the commit message, and stylistic inconsistence in
in_merge_bases() - though actually more clear than existing code.

Short, simple, and gives good performance improvenebts.


   commit: add short-circuit to paint_down_to_common()

Looks good to me; ignore [mostly] what I have written in response to the
patch in question.


   commit-graph: always load commit-graph information

Looks all right; question: parameter or one more global variable.


I responded to say that the global variable approach is incorrect. 
Parameter is important to functionality and performance.





   merge: check config before loading commits

This looks good to me.


  Documentation/technical/commit-graph.txt | 30 +--
  alloc.c  |  1 +
  builtin/merge.c  |  5 +-
  commit-graph.c   | 99 +++-
  commit-graph.h   |  8 ++
  commit.c | 54 +++--
  commit.h |  7 +-
  object.c |  2 +-
  ref-filter.c | 23 +-
  sha1_file.c  |  2 +-
  t/t5318-commit-graph.sh  |  9 +++
  11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707




Re: [PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-23 Thread Derrick Stolee

On 4/18/2018 8:02 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

All right, that is nice explanation of the why behind this change.


Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Couldn't this 'check_graph' parameter be a global variable similar to
the 'commit_graph' variable?  Maybe I am not understanding it.


See the two callers at the bottom of the patch. They have different 
purposes: one needs to fill in a valid commit struct, the other needs to 
check the commit buffer is valid (then throws away the struct). They 
have different values for 'check_graph'. Also, in parse_commit_gently() 
we check parse_commit_in_graph() before we call parse_commit_buffer, so 
we do not want to repeat work; in the case of a valid commit-graph file, 
but the commit is not in the commit-graph, we would repeat our binary 
search for the same commit.





Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 51 --
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
  {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;

I'm probably wrong, but isn't it unrelated change?


You're right. I saw this while I was in here, and there was a similar 
comment on this change in a different patch. Probably best to keep these 
cleanup things in a separate commit.



item->object.parsed = 1;
item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
  }
  
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)

+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   }
+}

All right (after the fix).


+
  int parse_commit_in_graph(struct commit *item)
  {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;

Hmmm... previously the function returned 1 if item->object.parsed, now
it returns 0 for this situation.  I don't understand this change.


The good news is that this change is unimportant (the only caller is 
parse_commit_gently() which checks item->object.parsed before calling 
parse_commit_in_graph()). I wonder why I reordered those things, anyway. 
I'll revert to simplify the patch.





-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, &(item->object.oid), 
);
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, po

Re: [PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-23 Thread Derrick Stolee

On 4/18/2018 7:19 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


[...]

[...], and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

If I understand the code properly, what happens is that we can now
short-circuit if all commits that are left are lower than the target
commit.

This is because max-order priority queue is used: if the commit with
maximum generation number is below generation number of target commit,
then target commit is not reachable from any commit in the priority
queue (all of which has generation number less or equal than the commit
at head of queue, i.e. all are same level or deeper); compare what I
have written in [1]

[1]: https://public-inbox.org/git/866052dkju@gmail.com/

Do I have that right?  If so, it looks all right to me.


Yes, the priority queue needs to compare via generation number first or 
there will be errors. This is why we could not use commit time before.





For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

[...]

flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
  
-	list = paint_down_to_common(one, n, twos);

+   list = paint_down_to_common(one, n, twos, 0);
  
  	while (list) {

struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
  
-	bases = paint_down_to_common(commit, nr_reference, reference);

+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);

Is it the only case where we would call paint_down_to_common() with
non-zero last parameter?  Would we always use commit->generation where
commit is the first parameter of paint_down_to_common()?

If both are true and will remain true, then in my humble opinion it is
not necessary to change the signature of this function.


We need to change the signature some way, but maybe the way I chose is 
not the best.


To elaborate: paint_down_to_common() is used for multiple purposes. The 
caller here that supplies 'commit->generation' is used only to compute 
reachability (by testing if the flag PARENT2 exists on the commit, then 
clears all flags). The other callers expect the full walk down to the 
common commits, and keeps those PARENT1, PARENT2, and STALE flags for 
future use (such as reporting merge bases). Usually the call to 
paint_down_to_common() is followed by a revision walk that only halts 
when reaching root commits or commits with both PARENT1 and PARENT2 
flags on, so always short-circuiting on generations would break the 
functionality; this is confirmed by the t5318-commit-graph.sh.


An alternative to the signature change is to add a boolean parameter 
"use_cutoff" or something, that specifies "don't walk beyond the 
commit". This may give a more of a clear description of what it will do 
with the generation value, but since we are already performing 
generation comparisons before calling paint_down_to_common() I find this 
simple enough.


Thanks,
-Stolee



Re: [PATCH v3 6/9] commit: use generation numbers for in_merge_bases()

2018-04-23 Thread Derrick Stolee

On 4/18/2018 6:15 PM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

You have "target" twice in above paragraph; one of those should probably
be something else.


Thanks. Second "target" should be "initial".


[...]

+
+   if (commit->generation > min_generation)
+   return 0;

Why not use "return ret;" instead of "return 0;", like the rest of the
code [cryptically] does, that is:

   +if (commit->generation > min_generation)
   +return ret;


Sure.

Thanks,
-Stolee


Re: [PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-23 Thread Derrick Stolee

On 4/18/2018 5:02 PM, Jakub Narebski wrote:

Here I can offer only the cursory examination, as I don't know this area
of code in question.

Derrick Stolee <dsto...@microsoft.com> writes:


A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

A question: what is the performance after if the "commit-graph" feature
is disabled, or there is no commit-graph file?  Is there performance
regression in this case, or is the difference negligible?


Negligible, since we are adding a small number of integer comparisons 
and the main cost is in commit parsing. More on commit parsing in 
response to your comments below.





Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  ref-filter.c | 23 +++
  1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  /*
   * Test whether the candidate or one of its parents is contained in the list.

  ^

Sidenote: when examining the code after the change, I have noticed that
the above part of commit header for the comtains_test() function is no
longer entirely correct, as the function only checks the candidate
commit, and in no place it access its parents.

But that is not your problem.


I'll add a commit in the next version that fixes this comment before I 
make any changes to the method.





   * Do not recurse to find out, though, but return -1 if inconclusive.
   */
  static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
  {
enum contains_result *cached = contains_cache_at(cache, candidate);
  
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit *candidate,
  
  	/* Otherwise, we don't know; prepare to recurse */

parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+

Looks good to me.

The only [minor] question may be whether to define separate type for
generation numbers, and whether to future proof the tests - though the
latter would be almost certainly overengineering, and the former
probablt too.


If we have multiple notions of generation, then we can refactor all 
references to the "generation" member.





return CONTAINS_UNKNOWN;
  }
  
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct commit *candidate,

  struct contains_cache *cache)
  {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }

Sholdn't the above be made conditional on the ability to get generation
numbers from the commit-graph file (feature is turned on and file
exists)?  Otherwise here after the change contains_tag_algo() now parses
each commit in 'want', which I think was not done previously.

With commit-graph file parsing is [probably] cheap.  Without it, not
necessary.

But I might be worrying about nothing.


Not nothing. This parses the "wants" when we previously did not parse 
the wants. Further: this parsing happens before we do the simple check 
of comparing the OID of the candidate against the wants.


The question is: are these parsed commits significant compared to the 
walk that will parse many more commits? It is certainly possible.


One way to fix this is to call 'prepare_commit_graph()' directly and 
then test that 'commit_graph' is non-null before performing any parses. 
I'm not thrilled with how that couples the commit-graph implementation 
to this feature, but that may be necessary to avoid regressions in the 
non-commit-graph case.




  
+	result = contains_test(candidate, want, cache, cutoff);

O

Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-23 Thread Derrick Stolee

On 4/21/2018 4:44 PM, Jakub Narebski wrote:

Jakub Narebski <jna...@gmail.com> writes:

Derrick Stolee <sto...@gmail.com> writes:

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.

The two main questions to consider are:

1. Can X reach Y?

That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.

[...]


And the thing to measure is a commit count. If possible, it would be
good to count commits walked (commits whose parent list is enumerated)
and commits inspected (commits that were listed as a parent of some
walked commit). Walked commits require a commit parse -- albeit from
the commit-graph instead of the ODB now -- while inspected commits
only check the in-memory cache.

[...]

For git.git and Linux, I like to use the release tags as tests. They
provide a realistic view of the linear history, and maintenance
releases have their own history from the major releases.

Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...


Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK=2=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
  * src = 468165c1d = v2.17.0
  * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

  * 468165c1d has level 18418 which it shares with 6 commits
  * 66d2e04ec has level 14776 which it shares with 93 commits
  * gen(468165c1d) - gen(66d2e04ec) = 3642

  algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
  ---+-++--+-+--+---+
  naive  | 48865   | 39599  | 244  | 9200|  |   |
  level  |  3086   |  2492  | 113  |  528| 285  |   |
  FELINE |   283   |   216  |  68  |0|  |  25   |
  lev+FELINE |   282   |   215  |  68  |0|   5  |  24   |
  ---+-++--+-+--+---+
  lev+FEL+mpi|79   |59  |  21  |0|   0  |   0   |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
   note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
   were used for negative-cut; note that those are not disjoint; node can
   be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.


Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.



P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, 

Re: [PATCH v10 00/36] Add directory rename detection to git

2018-04-19 Thread Derrick Stolee

On 4/19/2018 2:41 PM, Stefan Beller wrote:

On Thu, Apr 19, 2018 at 11:35 AM, Elijah Newren  wrote:

On Thu, Apr 19, 2018 at 10:57 AM, Elijah Newren  wrote:

This series is a reboot of the directory rename detection series that was
merged to master and then reverted due to the final patch having a buggy
can-skip-update check, as noted at
   https://public-inbox.org/git/xmqqmuya43cs@gitster-ct.c.googlers.com/
This series based on top of master.

...and merges cleanly to next but apparently has some minor conflicts
with both ds/lazy-load-trees and ps/test-chmtime-get from pu.

What's the preferred way to resolve this?  Rebase and resubmit my
series on pu, or something else?

If you were to base it off of pu, this series would depend on all other
series that pu contains. This is bad for the progress of this series.
(If it were to be merged to next, all other series would automatically
merge to next as well)

If the conflicts are minor, then Junio resolves them; if you want to be
nice, pick your merge point as

 git checkout origin/master
 git merge ds/lazy-load-trees
 git merge ps/test-chmtime-get
 git tag my-anchor

and put the series on top of that anchor.

If you do this, you'd want to be reasonably sure that
those two series are not in too much flux.


I believe ds/lazy-load-trees is queued for 'next'. I'm not surprised 
that there are some conflicts here. Any reference to the 'tree' member 
of a commit should be replaced with 'get_commit_tree(c)', or 
'get_commit_tree_oid(c)' if you only care about the tree's object id.


I think Stefan's suggestion is the best approach to get the right 
conflicts out of the way.


Thanks,
-Stolee


Re: [PATCH v3 3/9] commit: use generations in paint_down_to_common()

2018-04-18 Thread Derrick Stolee

On 4/18/2018 10:31 AM, Jakub Narebski wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Does it mean that it gives no measureable performance improvements for
typical test cases?


Not in this commit. When we add the `min_generation` parameter in a 
later commit, we do get a significant performance boost (when we can 
supply a non-zero value to `min_generation`).


This step of using generation numbers for the priority is important for 
that commit, but on its own has limited value outside of the clock-skew 
case mentioned above.





Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit.c | 20 +++-
  commit.h |  1 +
  2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..a44899c733 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
  }
  
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, void *unused)

+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generataions are equal */

Very minor typo in above comment:

s/generataions/generations/


Good catch!




+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
  int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
  {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
  /* all input commits in one and twos[] must have been parsed! */
  static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
  {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
  
diff --git a/commit.h b/commit.h

index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
  extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
  
  int compare_commits_by_commit_date(const void *a_, const void *b_, void *unused);

+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
  
  LAST_ARG_MUST_BE_NULL

  extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);




Re: What's cooking in git.git (Apr 2018, #02; Tue, 17)

2018-04-18 Thread Derrick Stolee

On 4/17/2018 9:04 PM, Junio C Hamano wrote:

Stefan Beller <sbel...@google.com> writes:


  What's the doneness of this thing?  I didn't recall seeing any
  response, especially ones that demonstrated the reviewer carefully
  read and thought about the issues surrounding the code.  Not that I
  spotted any problems in these patches myself, though.

Stolee and Brandon provided a "quick LGTM" type of review
https://public-inbox.org/git/20180409232536.gb102...@google.com/
https://public-inbox.org/git/9ddfee7e-025a-79c9-8d6b-700c65a14...@gmail.com/

Yup.  Giving positive reviews is harder than giving constructive
criticism.  Much harder.

As readers cannot tell from a "quick LGTM" between "I didn't read it
but it did not smell foul" and "I read it thoroughly, understood how
the solution works, it was presented well, and agree with the design
and implementation---there is nothing to add", the reviewers need to
come up with some way to express that it is the latter case rather
than the former.

I would not claim that I've perfected my technique to do so, but
when responding to such a "good" series, I rephrase the main idea in
the series in my own words to show that I as a reviewer read the
series well enough to be able to do so, perhaps with comparison with
possible alternatives I could think of and dicussion to argue that
the solution presented in the series is better, in an attempt to
demonstrate that I am qualified to say "this one is good" with good
enough understanding of both the issue the series addresses and the
solution in the series.


I'm sorry that my second message was terse. My response to v1 [1] was

> I looked through these patches and only found one set of whitespace > 
errors. Compiles and tests fine on my machine. > > Reviewed-by: Derrick 
Stolee <dsto...@microsoft.com> So, I pulled the code, went through it 
patch-by-patch, and saw that the transformations were made using the 
established pattern. The second review was to chime in that my v1 
comments had been addressed. Thanks, -Stolee
[1] 
https://public-inbox.org/git/6c319100-df47-3b8d-8661-24e4643ad...@gmail.com/


Re: [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

2018-04-17 Thread Derrick Stolee

On 4/17/2018 2:10 PM, Derrick Stolee wrote:

The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
   but may not be required to start. The goal is to have this default to
   true eventually, but we may want to delay that until the feature is
   stable.

* I implement a "--reachable" option to 'git commit-graph write' that
   iterates over all refs. This does the same as

git show-ref -s | git commit-graph write --stdin-commits

   but I don't know how to pipe two child processes together inside of Git.
   Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
   fixup! commit-graph: always load commit-graph information
   commit-graph: add 'check' subcommand
   commit-graph: check file header information
   commit-graph: parse commit from chosen graph
   commit-graph: check fanout and lookup table
   commit: force commit to parse from object database
   commit-graph: load a root tree from specific graph
   commit-graph: verify commit contents against odb
   fsck: check commit-graph
   commit-graph: add '--reachable' option
   gc: automatically write commit-graph files
   commit-graph: update design document

  Documentation/git-commit-graph.txt   |  15 +-
  Documentation/git-gc.txt |   4 +
  Documentation/technical/commit-graph.txt |   9 --
  builtin/commit-graph.c   |  79 +-
  builtin/fsck.c   |  13 ++
  builtin/gc.c |   8 +
  commit-graph.c   | 178 ++-
  commit-graph.h   |   2 +
  commit.c |  14 +-
  commit.h |   1 +
  t/t5318-commit-graph.sh  |  15 ++
  11 files changed, 311 insertions(+), 27 deletions(-)


This RFC is also available as a GitHub pull request [1]

[1] https://github.com/derrickstolee/git/pull/6



[RFC PATCH 05/12] commit-graph: check fanout and lookup table

2018-04-17 Thread Derrick Stolee
While running 'git commit-graph check', verify that the object IDs
are listed in lexicographic order and that the fanout table correctly
navigates into that list of object IDs.

In anticipation of checking the commits in the commit-graph file
against the object database, parse the commits from that file in
advance. We perform this parse now to ensure the object cache contains
only commits from this commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 6d0d303a7a..6e3c08cd5c 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -835,6 +835,9 @@ static int check_commit_graph_error;
 
 int check_commit_graph(struct commit_graph *g)
 {
+   uint32_t i, cur_fanout_pos = 0;
+   struct object_id prev_oid, cur_oid;
+
if (!g) {
graph_report(_("no commit-graph file loaded"));
return 1;
@@ -859,5 +862,36 @@ int check_commit_graph(struct commit_graph *g)
if (g->hash_len != GRAPH_OID_LEN)
graph_report(_("commit-graph has incorrect hash length: %d"), 
g->hash_len);
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   if (i > 0 && oidcmp(_oid, _oid) >= 0)
+   graph_report(_("commit-graph has incorrect oid order: 
%s then %s"),
+
+   oid_to_hex(_oid),
+   oid_to_hex(_oid));
+   oidcpy(_oid, _oid);
+
+   while (cur_oid.hash[0] > cur_fanout_pos) {
+   uint32_t fanout_value = get_be32(g->chunk_oid_fanout + 
cur_fanout_pos);
+   if (i != fanout_value)
+   graph_report(_("commit-graph has incorrect 
fanout value: fanout[%d] = %u != %u"),
+cur_fanout_pos, fanout_value, i);
+
+   cur_fanout_pos++;
+   }
+
+   graph_commit = lookup_commit(_oid);
+
+   if (!parse_commit_in_graph_one(g, graph_commit))
+   graph_report(_("failed to parse %s from commit-graph"), 
oid_to_hex(_oid));
+
+   if (graph_commit->graph_pos != i)
+   graph_report(_("graph_pos for commit %s is %u != %u"), 
oid_to_hex(_oid),
+graph_commit->graph_pos, i);
+   }
+
return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'

2018-04-17 Thread Derrick Stolee
The commit-graph feature is not useful to end users until the
commit-graph file is maintained automatically by Git during normal
upkeep operations. One natural place to trigger this write is during
'git gc'.

Before automatically generating a commit-graph, we need to be able to
verify the contents of a commit-graph file. Integrate commit-graph
checks into 'fsck' that check the commit-graph contents against commits
in the object database.

Things to think about:

* Are these the right integration points?

* gc.commitGraph defaults to true right now for the purpose of testing,
  but may not be required to start. The goal is to have this default to
  true eventually, but we may want to delay that until the feature is
  stable.

* I implement a "--reachable" option to 'git commit-graph write' that
  iterates over all refs. This does the same as 

git show-ref -s | git commit-graph write --stdin-commits

  but I don't know how to pipe two child processes together inside of Git.
  Perhaps this is a better solution, anyway.

What other things should I be considering in this case? I'm unfamiliar
with the inner-workings of 'fsck' and 'gc', so this is a new space for me.

This RFC is based on v3 of ds/generation-numbers, and the first commit
is a fixup! based on a bug in that version that I caught while prepping
this series.

Thanks,
-Stolee

Derrick Stolee (12):
  fixup! commit-graph: always load commit-graph information
  commit-graph: add 'check' subcommand
  commit-graph: check file header information
  commit-graph: parse commit from chosen graph
  commit-graph: check fanout and lookup table
  commit: force commit to parse from object database
  commit-graph: load a root tree from specific graph
  commit-graph: verify commit contents against odb
  fsck: check commit-graph
  commit-graph: add '--reachable' option
  gc: automatically write commit-graph files
  commit-graph: update design document

 Documentation/git-commit-graph.txt   |  15 +-
 Documentation/git-gc.txt |   4 +
 Documentation/technical/commit-graph.txt |   9 --
 builtin/commit-graph.c   |  79 +-
 builtin/fsck.c   |  13 ++
 builtin/gc.c |   8 +
 commit-graph.c   | 178 ++-
 commit-graph.h   |   2 +
 commit.c |  14 +-
 commit.h |   1 +
 t/t5318-commit-graph.sh  |  15 ++
 11 files changed, 311 insertions(+), 27 deletions(-)

-- 
2.17.0.39.g685157f7fb



[RFC PATCH 03/12] commit-graph: check file header information

2018-04-17 Thread Derrick Stolee
During a run of 'git commit-graph check', list the issues with the
header information in the commit-graph file. Some of this information
is inferred from the loaded 'struct commit_graph'.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index cd0634bba0..c5e5a0f860 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -820,7 +820,34 @@ void write_commit_graph(const char *obj_dir,
oids.nr = 0;
 }
 
+static int check_commit_graph_error;
+#define graph_report(...) { check_commit_graph_error = 1; printf(__VA_ARGS__); 
}
+
 int check_commit_graph(struct commit_graph *g)
 {
-   return !g;
+   if (!g) {
+   graph_report(_("no commit-graph file loaded"));
+   return 1;
+   }
+
+   check_commit_graph_error = 0;
+
+   if (get_be32(g->data) != GRAPH_SIGNATURE)
+   graph_report(_("commit-graph file has incorrect header"));
+
+   if (*(g->data + 4) != 1)
+   graph_report(_("commit-graph file version is not 1"));
+   if (*(g->data + 5) != GRAPH_OID_VERSION)
+   graph_report(_("commit-graph OID version is not 1 (SHA1)"));
+
+   if (!g->chunk_oid_fanout)
+   graph_report(_("commit-graph is missing the OID Fanout chunk"));
+   if (!g->chunk_oid_lookup)
+   graph_report(_("commit-graph is missing the OID Lookup chunk"));
+   if (!g->chunk_commit_data)
+   graph_report(_("commit-graph is missing the Commit Data 
chunk"));
+   if (g->hash_len != GRAPH_OID_LEN)
+   graph_report(_("commit-graph has incorrect hash length: %d"), 
g->hash_len);
+
+   return check_commit_graph_error;
 }
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 11/12] gc: automatically write commit-graph files

2018-04-17 Thread Derrick Stolee
The commit-graph file is a very helpful feature for speeding up git
operations. In order to make it more useful, write the commit-graph file
by default during standard garbage collection operations.

Add a 'gc.commitGraph' config setting that triggers writing a
commit-graph file after any non-trivial 'git gc' command.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-gc.txt | 4 
 builtin/gc.c | 8 
 2 files changed, 12 insertions(+)

diff --git a/Documentation/git-gc.txt b/Documentation/git-gc.txt
index 571b5a7e3c..17dd654a59 100644
--- a/Documentation/git-gc.txt
+++ b/Documentation/git-gc.txt
@@ -119,6 +119,10 @@ The optional configuration variable `gc.packRefs` 
determines if
 it within all non-bare repos or it can be set to a boolean value.
 This defaults to true.
 
+The optional configuration variable 'gc.commitGraph' determines if
+'git gc' runs 'git commit-graph write'. This can be set to a boolean
+value. This defaults to false.
+
 The optional configuration variable `gc.aggressiveWindow` controls how
 much time is spent optimizing the delta compression of the objects in
 the repository when the --aggressive option is specified.  The larger
diff --git a/builtin/gc.c b/builtin/gc.c
index 77fa720bd0..070f2a7a3d 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -34,6 +34,7 @@ static int aggressive_depth = 50;
 static int aggressive_window = 250;
 static int gc_auto_threshold = 6700;
 static int gc_auto_pack_limit = 50;
+static int gc_commit_graph = 1;
 static int detach_auto = 1;
 static timestamp_t gc_log_expire_time;
 static const char *gc_log_expire = "1.day.ago";
@@ -46,6 +47,7 @@ static struct argv_array repack = ARGV_ARRAY_INIT;
 static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
+static struct argv_array commit_graph = ARGV_ARRAY_INIT;
 
 static struct tempfile *pidfile;
 static struct lock_file log_lock;
@@ -121,6 +123,7 @@ static void gc_config(void)
git_config_get_int("gc.aggressivedepth", _depth);
git_config_get_int("gc.auto", _auto_threshold);
git_config_get_int("gc.autopacklimit", _auto_pack_limit);
+   git_config_get_bool("gc.commitgraph", _commit_graph);
git_config_get_bool("gc.autodetach", _auto);
git_config_get_expiry("gc.pruneexpire", _expire);
git_config_get_expiry("gc.worktreepruneexpire", 
_worktrees_expire);
@@ -374,6 +377,7 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
argv_array_pushl(, "prune", "--expire", NULL);
argv_array_pushl(_worktrees, "worktree", "prune", "--expire", 
NULL);
argv_array_pushl(, "rerere", "gc", NULL);
+   argv_array_pushl(_graph, "commit-graph", "write", "--reachable", 
NULL);
 
/* default expiry time, overwritten in gc_config */
gc_config();
@@ -480,6 +484,10 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
if (pack_garbage.nr > 0)
clean_pack_garbage();
 
+   if (gc_commit_graph)
+   if (run_command_v_opt(commit_graph.argv, RUN_GIT_CMD))
+   return error(FAILED_RUN, commit_graph.argv[0]);
+
if (auto_gc && too_many_loose_objects())
warning(_("There are too many unreachable loose objects; "
"run 'git prune' to remove them."));
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 02/12] commit-graph: add 'check' subcommand

2018-04-17 Thread Derrick Stolee
If the commit-graph file becomes corrupt, we need a way to verify
its contents match the object database. In the manner of 'git fsck'
we will implement a 'git commit-graph check' subcommand to report
all issues with the file.

Add the 'check' subcommand to the 'commit-graph' builtin and its
documentation. Add a simple test that ensures the command returns
a zero error code.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  7 +-
 builtin/commit-graph.c | 38 ++
 commit-graph.c |  5 
 commit-graph.h |  2 ++
 t/t5318-commit-graph.sh|  5 
 5 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 4c97b555cc..93c7841ba2 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -9,10 +9,10 @@ git-commit-graph - Write and verify Git commit graph files
 SYNOPSIS
 
 [verse]
+'git commit-graph check' [--object-dir ]
 'git commit-graph read' [--object-dir ]
 'git commit-graph write'  [--object-dir ]
 
-
 DESCRIPTION
 ---
 
@@ -52,6 +52,11 @@ existing commit-graph file.
 Read a graph file given by the commit-graph file and output basic
 details about the graph file. Used for debugging purposes.
 
+'check'::
+
+Read the commit-graph file and verify its contents against the object
+database. Used to check for corrupted data.
+
 
 EXAMPLES
 
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 37420ae0fd..77c1a04932 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -7,11 +7,17 @@
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
+   N_("git commit-graph check [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
NULL
 };
 
+static const char * const builtin_commit_graph_check_usage[] = {
+   N_("git commit-graph check [--object-dir ]"),
+   NULL
+};
+
 static const char * const builtin_commit_graph_read_usage[] = {
N_("git commit-graph read [--object-dir ]"),
NULL
@@ -29,6 +35,36 @@ static struct opts_commit_graph {
int append;
 } opts;
 
+
+static int graph_check(int argc, const char **argv)
+{
+   struct commit_graph *graph = 0;
+   char *graph_name;
+
+   static struct option builtin_commit_graph_check_options[] = {
+   OPT_STRING(0, "object-dir", _dir,
+  N_("dir"),
+  N_("The object directory to store the graph")),
+   OPT_END(),
+   };
+
+   argc = parse_options(argc, argv, NULL,
+builtin_commit_graph_check_options,
+builtin_commit_graph_check_usage, 0);
+
+   if (!opts.obj_dir)
+   opts.obj_dir = get_object_directory();
+
+   graph_name = get_commit_graph_filename(opts.obj_dir);
+   graph = load_commit_graph_one(graph_name);
+
+   if (!graph)
+   die("graph file %s does not exist", graph_name);
+   FREE_AND_NULL(graph_name);
+
+   return check_commit_graph(graph);
+}
+
 static int graph_read(int argc, const char **argv)
 {
struct commit_graph *graph = NULL;
@@ -160,6 +196,8 @@ int cmd_commit_graph(int argc, const char **argv, const 
char *prefix)
 PARSE_OPT_STOP_AT_NON_OPTION);
 
if (argc > 0) {
+   if (!strcmp(argv[0], "check"))
+   return graph_check(argc, argv);
if (!strcmp(argv[0], "read"))
return graph_read(argc, argv);
if (!strcmp(argv[0], "write"))
diff --git a/commit-graph.c b/commit-graph.c
index 3f0c142603..cd0634bba0 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -819,3 +819,8 @@ void write_commit_graph(const char *obj_dir,
oids.alloc = 0;
oids.nr = 0;
 }
+
+int check_commit_graph(struct commit_graph *g)
+{
+   return !g;
+}
diff --git a/commit-graph.h b/commit-graph.h
index 96cccb10f3..e8c8d99dff 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -53,4 +53,6 @@ void write_commit_graph(const char *obj_dir,
int nr_commits,
int append);
 
+int check_commit_graph(struct commit_graph *g);
+
 #endif
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index 77d85aefe7..e91053271a 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -230,4 +230,9 @@ test_expect_success 'perform fast-forward merge in full 
repo' '
test_cmp expect output
 '
 
+test_expect_success 'git commit-graph check' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git commit-graph check >output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 08/12] commit-graph: verify commit contents against odb

2018-04-17 Thread Derrick Stolee
When running 'git commit-graph check', compare the contents of the
commits that are loaded from the commit-graph file with commits that are
loaded directly from the object database. This includes checking the
root tree object ID, commit date, and parents.

In addition, verify the generation number calculation is correct for all
commits in the commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 82 ++
 1 file changed, 82 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 80a2ac2a6a..b5bce2bac4 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -899,5 +899,87 @@ int check_commit_graph(struct commit_graph *g)
 graph_commit->graph_pos, i);
}
 
+   for (i = 0; i < g->num_commits; i++) {
+   struct commit *graph_commit, *odb_commit;
+   struct commit_list *graph_parents, *odb_parents;
+   int num_parents = 0;
+
+   hashcpy(cur_oid.hash, g->chunk_oid_lookup + g->hash_len * i);
+
+   graph_commit = lookup_commit(_oid);
+   odb_commit = (struct commit *)create_object(cur_oid.hash, 
alloc_commit_node());
+   if (parse_commit_internal(odb_commit, 0, 0))
+   graph_report(_("failed to parse %s from object 
database"), oid_to_hex(_oid));
+
+   if (oidcmp(_commit_tree_in_graph_one(g, 
graph_commit)->object.oid,
+  get_commit_tree_oid(odb_commit)))
+   graph_report(_("root tree object ID for commit %s in 
commit-graph is %s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(get_commit_tree_oid(graph_commit)),
+
oid_to_hex(get_commit_tree_oid(odb_commit)));
+
+   if (graph_commit->date != odb_commit->date)
+   graph_report(_("commit date for commit %s in 
commit-graph is %"PRItime" != %"PRItime""),
+oid_to_hex(_oid),
+graph_commit->date,
+odb_commit->date);
+
+
+   graph_parents = graph_commit->parents;
+   odb_parents = odb_commit->parents;
+
+   while (graph_parents) {
+   num_parents++;
+
+   if (odb_parents == NULL)
+   graph_report(_("commit-graph parent list for 
commit %s is too long (%d)"),
+oid_to_hex(_oid),
+num_parents);
+
+   if (oidcmp(_parents->item->object.oid, 
_parents->item->object.oid))
+   graph_report(_("commit-graph parent for %s is 
%s != %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid),
+
oid_to_hex(_parents->item->object.oid));
+
+   graph_parents = graph_parents->next;
+   odb_parents = odb_parents->next;
+   }
+
+   if (odb_parents != NULL)
+   graph_report(_("commit-graph parent list for commit %s 
terminates early"),
+oid_to_hex(_oid));
+
+   if (graph_commit->generation) {
+   uint32_t max_generation = 0;
+   graph_parents = graph_commit->parents;
+
+   while (graph_parents) {
+   if (graph_parents->item->generation == 
GENERATION_NUMBER_ZERO ||
+   graph_parents->item->generation == 
GENERATION_NUMBER_INFINITY)
+   graph_report(_("commit-graph has valid 
generation for %s but not its parent, %s"),
+oid_to_hex(_oid),
+
oid_to_hex(_parents->item->object.oid));
+   if (graph_parents->item->generation > 
max_generation)
+   max_generation = 
graph_parents->item->generation;
+   graph_parents = graph_parents->next;
+   }
+
+   if (graph_commit->generation != max_generation + 1)
+   graph_report(_("commit-graph has incorrect 
generation for %s"),
+oid_to_hex(_oid));
+   } else {
+   graph_parents = graph_commit->parents;
+
+   

[RFC PATCH 09/12] fsck: check commit-graph

2018-04-17 Thread Derrick Stolee
If a commit-graph file exists, check its contents during 'git fsck'.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/fsck.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index ef78c6c00c..9712f230ba 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -16,6 +16,7 @@
 #include "streaming.h"
 #include "decorate.h"
 #include "packfile.h"
+#include "run-command.h"
 
 #define REACHABLE 0x0001
 #define SEEN  0x0002
@@ -45,6 +46,7 @@ static int name_objects;
 #define ERROR_REACHABLE 02
 #define ERROR_PACK 04
 #define ERROR_REFS 010
+#define ERROR_COMMIT_GRAPH 020
 
 static const char *describe_object(struct object *obj)
 {
@@ -815,5 +817,16 @@ int cmd_fsck(int argc, const char **argv, const char 
*prefix)
}
 
check_connectivity();
+
+   if (core_commit_graph) {
+   struct child_process commit_graph_check = CHILD_PROCESS_INIT;
+   const char *check_argv[] = { "commit-graph", "check", NULL, 
NULL };
+   commit_graph_check.argv = check_argv;
+   commit_graph_check.git_cmd = 1;
+
+   if (run_command(_graph_check))
+   errors_found |= ERROR_COMMIT_GRAPH;
+   }
+
return errors_found;
 }
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 06/12] commit: force commit to parse from object database

2018-04-17 Thread Derrick Stolee
In anticipation of checking commit-graph file contents against the
object database, create parse_commit_internal() to allow side-stepping
the commit-graph file and parse directly from the object database.

Due to the use of generation numbers, this method should not be called
unless the intention is explicit in avoiding commits from the
commit-graph file.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 14 ++
 commit.h |  1 +
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index 9ef6f699bd..07752d8503 100644
--- a/commit.c
+++ b/commit.c
@@ -392,7 +392,8 @@ int parse_commit_buffer(struct commit *item, const void 
*buffer, unsigned long s
return 0;
 }
 
-int parse_commit_gently(struct commit *item, int quiet_on_missing)
+
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph)
 {
enum object_type type;
void *buffer;
@@ -403,17 +404,17 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return -1;
if (item->object.parsed)
return 0;
-   if (parse_commit_in_graph(item))
+   if (use_commit_graph && parse_commit_in_graph(item))
return 0;
buffer = read_sha1_file(item->object.oid.hash, , );
if (!buffer)
return quiet_on_missing ? -1 :
error("Could not read %s",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
if (type != OBJ_COMMIT) {
free(buffer);
return error("Object %s not a commit",
-oid_to_hex(>object.oid));
+   oid_to_hex(>object.oid));
}
ret = parse_commit_buffer(item, buffer, size, 0);
if (save_commit_buffer && !ret) {
@@ -424,6 +425,11 @@ int parse_commit_gently(struct commit *item, int 
quiet_on_missing)
return ret;
 }
 
+int parse_commit_gently(struct commit *item, int quiet_on_missing)
+{
+   return parse_commit_internal(item, quiet_on_missing, 1);
+}
+
 void parse_commit_or_die(struct commit *item)
 {
if (parse_commit(item))
diff --git a/commit.h b/commit.h
index b5afde1ae9..5fde74fcd7 100644
--- a/commit.h
+++ b/commit.h
@@ -73,6 +73,7 @@ struct commit *lookup_commit_reference_by_name(const char 
*name);
 struct commit *lookup_commit_or_die(const struct object_id *oid, const char 
*ref_name);
 
 int parse_commit_buffer(struct commit *item, const void *buffer, unsigned long 
size, int check_graph);
+int parse_commit_internal(struct commit *item, int quiet_on_missing, int 
use_commit_graph);
 int parse_commit_gently(struct commit *item, int quiet_on_missing);
 static inline int parse_commit(struct commit *item)
 {
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 01/12] fixup! commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee
---
 commit-graph.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/commit-graph.c b/commit-graph.c
index 21e853c21a..3f0c142603 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -304,7 +304,7 @@ static int find_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
*pos = item->graph_pos;
return 1;
} else {
-   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   return bsearch_graph(g, &(item->object.oid), pos);
}
 }
 
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 10/12] commit-graph: add '--reachable' option

2018-04-17 Thread Derrick Stolee
When writing commit-graph files, it can be convenient to ask for all
reachable commits (starting at the ref set) in the resulting file. This
is particularly helpful when writing to stdin is complicated, such as a
future integration with 'git gc' which will call
'git commit-graph write --reachable' after performing cleanup of the
object database.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/git-commit-graph.txt |  8 --
 builtin/commit-graph.c | 41 +++---
 t/t5318-commit-graph.sh| 10 
 3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/git-commit-graph.txt 
b/Documentation/git-commit-graph.txt
index 93c7841ba2..1b14d40590 100644
--- a/Documentation/git-commit-graph.txt
+++ b/Documentation/git-commit-graph.txt
@@ -37,12 +37,16 @@ Write a commit graph file based on the commits found in 
packfiles.
 +
 With the `--stdin-packs` option, generate the new commit graph by
 walking objects only in the specified pack-indexes. (Cannot be combined
-with --stdin-commits.)
+with --stdin-commits or --reachable.)
 +
 With the `--stdin-commits` option, generate the new commit graph by
 walking commits starting at the commits specified in stdin as a list
 of OIDs in hex, one OID per line. (Cannot be combined with
---stdin-packs.)
+--stdin-packs or --reachable.)
++
+With the `--reachable` option, generate the new commit graph by walking
+commits starting at all refs. (Cannot be combined with --stdin-commits
+or --stind-packs.)
 +
 With the `--append` option, include all commits that are present in the
 existing commit-graph file.
diff --git a/builtin/commit-graph.c b/builtin/commit-graph.c
index 77c1a04932..a89285ada8 100644
--- a/builtin/commit-graph.c
+++ b/builtin/commit-graph.c
@@ -3,13 +3,14 @@
 #include "dir.h"
 #include "lockfile.h"
 #include "parse-options.h"
+#include "refs.h"
 #include "commit-graph.h"
 
 static char const * const builtin_commit_graph_usage[] = {
N_("git commit-graph [--object-dir ]"),
N_("git commit-graph check [--object-dir ]"),
N_("git commit-graph read [--object-dir ]"),
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
@@ -24,12 +25,13 @@ static const char * const builtin_commit_graph_read_usage[] 
= {
 };
 
 static const char * const builtin_commit_graph_write_usage[] = {
-   N_("git commit-graph write [--object-dir ] [--append] 
[--stdin-packs|--stdin-commits]"),
+   N_("git commit-graph write [--object-dir ] [--append] 
[--reachable|--stdin-packs|--stdin-commits]"),
NULL
 };
 
 static struct opts_commit_graph {
const char *obj_dir;
+   int reachable;
int stdin_packs;
int stdin_commits;
int append;
@@ -113,6 +115,25 @@ static int graph_read(int argc, const char **argv)
return 0;
 }
 
+struct hex_list {
+   char **hex_strs;
+   int hex_nr;
+   int hex_alloc;
+};
+
+static int add_ref_to_list(const char *refname,
+  const struct object_id *oid,
+  int flags, void *cb_data)
+{
+   struct hex_list *list = (struct hex_list*)cb_data;
+
+   ALLOC_GROW(list->hex_strs, list->hex_nr + 1, list->hex_alloc);
+   list->hex_strs[list->hex_nr] = xcalloc(GIT_MAX_HEXSZ + 1, 1);
+   strcpy(list->hex_strs[list->hex_nr], oid_to_hex(oid));
+   list->hex_nr++;
+   return 0;
+}
+
 static int graph_write(int argc, const char **argv)
 {
const char **pack_indexes = NULL;
@@ -127,6 +148,8 @@ static int graph_write(int argc, const char **argv)
OPT_STRING(0, "object-dir", _dir,
N_("dir"),
N_("The object directory to store the graph")),
+   OPT_BOOL(0, "reachable", ,
+   N_("start walk at all refs")),
OPT_BOOL(0, "stdin-packs", _packs,
N_("scan pack-indexes listed by stdin for commits")),
OPT_BOOL(0, "stdin-commits", _commits,
@@ -140,8 +163,8 @@ static int graph_write(int argc, const char **argv)
 builtin_commit_graph_write_options,
 builtin_commit_graph_write_usage, 0);
 
-   if (opts.stdin_packs && opts.stdin_commits)
-   die(_("cannot use both --stdin-commits and --stdin-packs"));
+   if (opts.reachable + opts.stdin_packs + opts.stdin_commits > 1)
+   die(_("use at most one of --reachable, --stdin-commits, or 
--stdin-packs"));
if (!opts.obj_dir)
   

[RFC PATCH 07/12] commit-graph: load a root tree from specific graph

2018-04-17 Thread Derrick Stolee
When lazy-loading a tree for a commit, it will be important to select
the tree from a specific struct commit_graph. Create a new method that
specifies the commit-graph file and use that in
get_commit_tree_in_graph().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 6e3c08cd5c..80a2ac2a6a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -354,14 +354,20 @@ static struct tree *load_tree_for_commit(struct 
commit_graph *g, struct commit *
return c->maybe_tree;
 }
 
-struct tree *get_commit_tree_in_graph(const struct commit *c)
+static struct tree *get_commit_tree_in_graph_one(struct commit_graph *g,
+const struct commit *c)
 {
if (c->maybe_tree)
return c->maybe_tree;
if (c->graph_pos == COMMIT_NOT_FROM_GRAPH)
BUG("get_commit_tree_in_graph called from non-commit-graph 
commit");
 
-   return load_tree_for_commit(commit_graph, (struct commit *)c);
+   return load_tree_for_commit(g, (struct commit *)c);
+}
+
+struct tree *get_commit_tree_in_graph(const struct commit *c)
+{
+   return get_commit_tree_in_graph_one(commit_graph, c);
 }
 
 static void write_graph_chunk_fanout(struct hashfile *f,
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 12/12] commit-graph: update design document

2018-04-17 Thread Derrick Stolee
The commit-graph feature is now integrated with 'fsck' and 'gc',
so remove those items from the "Future Work" section of the
commit-graph design document.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 9 -
 1 file changed, 9 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index d9f2713efa..d04657b781 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -118,9 +118,6 @@ Future Work
 - The commit graph feature currently does not honor commit grafts. This can
   be remedied by duplicating or refactoring the current graft logic.
 
-- The 'commit-graph' subcommand does not have a "verify" mode that is
-  necessary for integration with fsck.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
@@ -142,12 +139,6 @@ Future Work
   such as "ensure_tree_loaded(commit)" that fully loads a tree before
   using commit->tree.
 
-- The current design uses the 'commit-graph' subcommand to generate the graph.
-  When this feature stabilizes enough to recommend to most users, we should
-  add automatic graph writes to common operations that create many commits.
-  For example, one could compute a graph on 'clone', 'fetch', or 'repack'
-  commands.
-
 - A server could provide a commit graph file as part of the network protocol
   to avoid extra calculations by clients. This feature is only of benefit if
   the user is willing to trust the file, because verifying the file is correct
-- 
2.17.0.39.g685157f7fb



[RFC PATCH 04/12] commit-graph: parse commit from chosen graph

2018-04-17 Thread Derrick Stolee
Before checking a commit-graph file against the object database, we
need to parse all commits from the given commit-graph file. Create
parse_commit_in_graph_one() to target a given struct commit_graph.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index c5e5a0f860..6d0d303a7a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -308,17 +308,27 @@ static int find_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
}
 }
 
-int parse_commit_in_graph(struct commit *item)
+int parse_commit_in_graph_one(struct commit_graph *g, struct commit *item)
 {
uint32_t pos;
 
if (item->object.parsed)
-   return 0;
+   return 1;
+
+   if (find_commit_in_graph(item, g, ))
+   return fill_commit_in_graph(item, g, pos);
+
+   return 0;
+}
+
+int parse_commit_in_graph(struct commit *item)
+{
if (!core_commit_graph)
return 0;
+
prepare_commit_graph();
-   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
-   return fill_commit_in_graph(item, commit_graph, pos);
+   if (commit_graph)
+   return parse_commit_in_graph_one(commit_graph, item);
return 0;
 }
 
-- 
2.17.0.39.g685157f7fb



Re: [PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee

On 4/17/2018 1:00 PM, Derrick Stolee wrote:

Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
  commit-graph.c | 51 --
  commit-graph.h |  8 
  commit.c   |  7 +--
  commit.h   |  2 +-
  object.c   |  2 +-
  sha1_file.c|  2 +-
  6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
  }
  
+static void fill_commit_graph_info(struct commit *item, struct commit_graph *g, uint32_t pos)

+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
  static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
  {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
  
  	item->object.parsed = 1;

item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
  }
  
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, uint32_t *pos)

+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);


The reference to 'commit_graph' in the above line should be 'g'. Sorry!


+   }
+}
+
  int parse_commit_in_graph(struct commit *item)
  {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, &(item->object.oid), 
);
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
  }
  
+void load_commit_graph_info(struct commit *item)

+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
  static struct tree *load_tree_for_commit(struct commit_graph *g, struct 
commit *c)
  {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
   */
  int parse_commit_in_graph(struct commit *item);
  
+/*

+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
  struct tree *get_commit_tree_in_graph(const struct commit *c);
  
  struct commit_graph {

diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_comm

[PATCH v3 9/9] merge: check config before loading commits

2018-04-17 Thread Derrick Stolee
Now that we use generation numbers from the commit-graph, we must
ensure that all commits that exist in the commit-graph are loaded
from that file instead of from the object database. Since the
commit-graph file is only checked if core.commitGraph is true, we
must check the default config before we load any commits.

In the merge builtin, the config was checked after loading the HEAD
commit. This was due to the use of the global 'branch' when checking
merge-specific config settings.

Move the config load to be between the initialization of 'branch' and
the commit lookup.

Without this change, a fast-forward merge would hit a BUG("bad
generation skip") statement in commit.c during paint_down_to_common().
This is because the HEAD commit would be loaded with "infinite"
generation but then reached by commits with "finite" generation
numbers.

Add a test to t5318-commit-graph.sh that exercises this code path to
prevent a regression.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 builtin/merge.c | 5 +++--
 t/t5318-commit-graph.sh | 9 +
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/builtin/merge.c b/builtin/merge.c
index 5e5e4497e3..7e1da6c6ea 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1148,13 +1148,14 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
 
-   init_diff_ui_defaults();
-   git_config(git_merge_config, NULL);
 
if (branch_mergeoptions)
parse_branch_merge_options(branch_mergeoptions);
diff --git a/t/t5318-commit-graph.sh b/t/t5318-commit-graph.sh
index a380419b65..77d85aefe7 100755
--- a/t/t5318-commit-graph.sh
+++ b/t/t5318-commit-graph.sh
@@ -221,4 +221,13 @@ test_expect_success 'write graph in bare repo' '
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 1' bare commits/8 
merge/1
 graph_git_behavior 'bare repo with graph, commit 8 vs merge 2' bare commits/8 
merge/2
 
+test_expect_success 'perform fast-forward merge in full repo' '
+   cd "$TRASH_DIRECTORY/full" &&
+   git checkout -b merge-5-to-8 commits/5 &&
+   git merge commits/8 &&
+   git show-ref -s merge-5-to-8 >output &&
+   git show-ref -s commits/8 >expect &&
+   test_cmp expect output
+'
+
 test_done
-- 
2.17.0.39.g685157f7fb



[PATCH v3 5/9] ref-filter: use generation number for --contains

2018-04-17 Thread Derrick Stolee
A commit A can reach a commit B only if the generation number of A
is larger than the generation number of B. This condition allows
significantly short-circuiting commit-graph walks.

Use generation number for 'git tag --contains' queries.

On a copy of the Linux repository where HEAD is containd in v4.13
but no earlier tag, the command 'git tag --contains HEAD' had the
following peformance improvement:

Before: 0.81s
After:  0.04s
Rel %:  -95%

Helped-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 ref-filter.c | 23 +++
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/ref-filter.c b/ref-filter.c
index cffd8bf3ce..e2fea6d635 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -1587,7 +1587,8 @@ static int in_commit_list(const struct commit_list *want, 
struct commit *c)
  */
 static enum contains_result contains_test(struct commit *candidate,
  const struct commit_list *want,
- struct contains_cache *cache)
+ struct contains_cache *cache,
+ uint32_t cutoff)
 {
enum contains_result *cached = contains_cache_at(cache, candidate);
 
@@ -1603,6 +1604,10 @@ static enum contains_result contains_test(struct commit 
*candidate,
 
/* Otherwise, we don't know; prepare to recurse */
parse_commit_or_die(candidate);
+
+   if (candidate->generation < cutoff)
+   return CONTAINS_NO;
+
return CONTAINS_UNKNOWN;
 }
 
@@ -1618,8 +1623,18 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
  struct contains_cache *cache)
 {
struct contains_stack contains_stack = { 0, 0, NULL };
-   enum contains_result result = contains_test(candidate, want, cache);
+   enum contains_result result;
+   uint32_t cutoff = GENERATION_NUMBER_INFINITY;
+   const struct commit_list *p;
+
+   for (p = want; p; p = p->next) {
+   struct commit *c = p->item;
+   parse_commit_or_die(c);
+   if (c->generation < cutoff)
+   cutoff = c->generation;
+   }
 
+   result = contains_test(candidate, want, cache, cutoff);
if (result != CONTAINS_UNKNOWN)
return result;
 
@@ -1637,7 +1652,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
 * If we just popped the stack, parents->item has been marked,
 * therefore contains_test will return a meaningful yes/no.
 */
-   else switch (contains_test(parents->item, want, cache)) {
+   else switch (contains_test(parents->item, want, cache, cutoff)) 
{
case CONTAINS_YES:
*contains_cache_at(cache, commit) = CONTAINS_YES;
contains_stack.nr--;
@@ -1651,7 +1666,7 @@ static enum contains_result contains_tag_algo(struct 
commit *candidate,
}
}
free(contains_stack.contains_stack);
-   return contains_test(candidate, want, cache);
+   return contains_test(candidate, want, cache, cutoff);
 }
 
 static int commit_contains(struct ref_filter *filter, struct commit *commit,
-- 
2.17.0.39.g685157f7fb



[PATCH v3 4/9] commit-graph.txt: update design document

2018-04-17 Thread Derrick Stolee
We now calculate generation numbers in the commit-graph file and use
them in paint_down_to_common().

Expand the section on generation numbers to discuss how the three
special generation numbers GENERATION_NUMBER_INFINITY, _ZERO, and
_MAX interact with other generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 Documentation/technical/commit-graph.txt | 30 +++-
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/Documentation/technical/commit-graph.txt 
b/Documentation/technical/commit-graph.txt
index 0550c6d0dc..d9f2713efa 100644
--- a/Documentation/technical/commit-graph.txt
+++ b/Documentation/technical/commit-graph.txt
@@ -77,6 +77,29 @@ in the commit graph. We can treat these commits as having 
"infinite"
 generation number and walk until reaching commits with known generation
 number.
 
+We use the macro GENERATION_NUMBER_INFINITY = 0x to mark commits not
+in the commit-graph file. If a commit-graph file was written by a version
+of Git that did not compute generation numbers, then those commits will
+have generation number represented by the macro GENERATION_NUMBER_ZERO = 0.
+
+Since the commit-graph file is closed under reachability, we can guarantee
+the following weaker condition on all commits:
+
+If A and B are commits with generation numbers N amd M, respectively,
+and N < M, then A cannot reach B.
+
+Note how the strict inequality differs from the inequality when we have
+fully-computed generation numbers. Using strict inequality may result in
+walking a few extra commits, but the simplicity in dealing with commits
+with generation number *_INFINITY or *_ZERO is valuable.
+
+We use the macro GENERATION_NUMBER_MAX = 0x3FFF to for commits whose
+generation numbers are computed to be at least this value. We limit at
+this value since it is the largest value that can be stored in the
+commit-graph file using the 30 bits available to generation numbers. This
+presents another case where a commit can have generation number equal to
+that of a parent.
+
 Design Details
 --
 
@@ -98,17 +121,12 @@ Future Work
 - The 'commit-graph' subcommand does not have a "verify" mode that is
   necessary for integration with fsck.
 
-- The file format includes room for precomputed generation numbers. These
-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
 - After computing and storing generation numbers, we must make graph
   walks aware of generation numbers to gain the performance benefits they
   enable. This will mostly be accomplished by swapping a commit-date-ordered
   priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:
 
-- paint_down_to_common()
 - 'log --topo-order'
 
 - Currently, parse_commit_gently() requires filling in the root tree
-- 
2.17.0.39.g685157f7fb



[PATCH v3 7/9] commit: add short-circuit to paint_down_to_common()

2018-04-17 Thread Derrick Stolee
When running 'git branch --contains', the in_merge_bases_many()
method calls paint_down_to_common() to discover if a specific
commit is reachable from a set of branches. Commits with lower
generation number are not needed to correctly answer the
containment query of in_merge_bases_many().

Add a new parameter, min_generation, to paint_down_to_common() that
prevents walking commits with generation number strictly less than
min_generation. If 0 is given, then there is no functional change.

For in_merge_bases_many(), we can pass commit->generation as the
cutoff, and this saves time during 'git branch --contains' queries
that would otherwise walk "around" the commit we are inspecting.

For a copy of the Linux repository, where HEAD is checked out at
v4.13~100, we get the following performance improvement for
'git branch --contains' over the previous commit:

Before: 0.21s
After:  0.13s
Rel %: -38%

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 18 ++
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/commit.c b/commit.c
index bceb79c419..a70f120878 100644
--- a/commit.c
+++ b/commit.c
@@ -805,11 +805,14 @@ static int queue_has_nonstale(struct prio_queue *queue)
 }
 
 /* all input commits in one and twos[] must have been parsed! */
-static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
+static struct commit_list *paint_down_to_common(struct commit *one, int n,
+   struct commit **twos,
+   int min_generation)
 {
struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
+   uint32_t last_gen = GENERATION_NUMBER_INFINITY;
 
one->object.flags |= PARENT1;
if (!n) {
@@ -828,6 +831,13 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
struct commit_list *parents;
int flags;
 
+   if (commit->generation > last_gen)
+   BUG("bad generation skip");
+   last_gen = commit->generation;
+
+   if (commit->generation < min_generation)
+   break;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -876,7 +886,7 @@ static struct commit_list *merge_bases_many(struct commit 
*one, int n, struct co
return NULL;
}
 
-   list = paint_down_to_common(one, n, twos);
+   list = paint_down_to_common(one, n, twos, 0);
 
while (list) {
struct commit *commit = pop_commit();
@@ -943,7 +953,7 @@ static int remove_redundant(struct commit **array, int cnt)
filled_index[filled] = j;
work[filled++] = array[j];
}
-   common = paint_down_to_common(array[i], filled, work);
+   common = paint_down_to_common(array[i], filled, work, 0);
if (array[i]->object.flags & PARENT2)
redundant[i] = 1;
for (j = 0; j < filled; j++)
@@ -1067,7 +1077,7 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
if (commit->generation > min_generation)
return 0;
 
-   bases = paint_down_to_common(commit, nr_reference, reference);
+   bases = paint_down_to_common(commit, nr_reference, reference, 
commit->generation);
if (commit->object.flags & PARENT2)
ret = 1;
clear_commit_marks(commit, all_flags);
-- 
2.17.0.39.g685157f7fb



[PATCH v3 2/9] commit-graph: compute generation numbers

2018-04-17 Thread Derrick Stolee
While preparing commits to be written into a commit-graph file, compute
the generation numbers using a depth-first strategy.

The only commits that are walked in this depth-first search are those
without a precomputed generation number. Thus, computation time will be
relative to the number of new commits to the commit-graph file.

If a computed generation number would exceed GENERATION_NUMBER_MAX, then
use GENERATION_NUMBER_MAX instead.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/commit-graph.c b/commit-graph.c
index 9ad21c3ffb..688d5b1801 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -439,6 +439,10 @@ static void write_graph_chunk_data(struct hashfile *f, int 
hash_len,
else
packedDate[0] = 0;
 
+   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   packedDate[0] |= htonl((*list)->generation << 2);
+   }
+
packedDate[1] = htonl((*list)->date);
hashwrite(f, packedDate, 8);
 
@@ -571,6 +575,46 @@ static void close_reachable(struct packed_oid_list *oids)
}
 }
 
+static void compute_generation_numbers(struct commit** commits,
+  int nr_commits)
+{
+   int i;
+   struct commit_list *list = NULL;
+
+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+   struct commit *current = list->item;
+   struct commit_list *parent;
+   int all_parents_computed = 1;
+   uint32_t max_generation = 0;
+
+   for (parent = current->parents; parent; parent = 
parent->next) {
+   if (parent->item->generation == 
GENERATION_NUMBER_INFINITY ||
+   parent->item->generation == 
GENERATION_NUMBER_ZERO) {
+   all_parents_computed = 0;
+   commit_list_insert(parent->item, );
+   break;
+   } else if (parent->item->generation > 
max_generation) {
+   max_generation = 
parent->item->generation;
+   }
+   }
+
+   if (all_parents_computed) {
+   current->generation = max_generation + 1;
+   pop_commit();
+   }
+
+   if (current->generation > GENERATION_NUMBER_MAX)
+   current->generation = GENERATION_NUMBER_MAX;
+   }
+   }
+}
+
 void write_commit_graph(const char *obj_dir,
const char **pack_indexes,
int nr_packs,
@@ -694,6 +738,8 @@ void write_commit_graph(const char *obj_dir,
if (commits.nr >= GRAPH_PARENT_MISSING)
die(_("too many commits to write graph"));
 
+   compute_generation_numbers(commits.list, commits.nr);
+
graph_name = get_commit_graph_filename(obj_dir);
fd = hold_lock_file_for_update(, graph_name, 0);
 
-- 
2.17.0.39.g685157f7fb



[PATCH v3 8/9] commit-graph: always load commit-graph information

2018-04-17 Thread Derrick Stolee
Most code paths load commits using lookup_commit() and then
parse_commit(). In some cases, including some branch lookups, the commit
is parsed using parse_object_buffer() which side-steps parse_commit() in
favor of parse_commit_buffer().

With generation numbers in the commit-graph, we need to ensure that any
commit that exists in the commit-graph file has its generation number
loaded.

Create new load_commit_graph_info() method to fill in the information
for a commit that exists only in the commit-graph file. Call it from
parse_commit_buffer() after loading the other commit information from
the given buffer. Only fill this information when specified by the
'check_graph' parameter. This avoids duplicate work when we already
checked the graph in parse_commit_gently() or when simply checking the
buffer contents in check_commit().

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit-graph.c | 51 --
 commit-graph.h |  8 
 commit.c   |  7 +--
 commit.h   |  2 +-
 object.c   |  2 +-
 sha1_file.c|  2 +-
 6 files changed, 49 insertions(+), 23 deletions(-)

diff --git a/commit-graph.c b/commit-graph.c
index 688d5b1801..21e853c21a 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -245,13 +245,19 @@ static struct commit_list **insert_parent_or_die(struct 
commit_graph *g,
return _list_insert(c, pptr)->next;
 }
 
+static void fill_commit_graph_info(struct commit *item, struct commit_graph 
*g, uint32_t pos)
+{
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+}
+
 static int fill_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t pos)
 {
uint32_t edge_value;
uint32_t *parent_data_ptr;
uint64_t date_low, date_high;
struct commit_list **pptr;
-   const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len 
+ 16) * pos;
+   const unsigned char *commit_data = g->chunk_commit_data + 
GRAPH_DATA_WIDTH * pos;
 
item->object.parsed = 1;
item->graph_pos = pos;
@@ -292,31 +298,40 @@ static int fill_commit_in_graph(struct commit *item, 
struct commit_graph *g, uin
return 1;
 }
 
+static int find_commit_in_graph(struct commit *item, struct commit_graph *g, 
uint32_t *pos)
+{
+   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
+   *pos = item->graph_pos;
+   return 1;
+   } else {
+   return bsearch_graph(commit_graph, &(item->object.oid), pos);
+   }
+}
+
 int parse_commit_in_graph(struct commit *item)
 {
+   uint32_t pos;
+
+   if (item->object.parsed)
+   return 0;
if (!core_commit_graph)
return 0;
-   if (item->object.parsed)
-   return 1;
-
prepare_commit_graph();
-   if (commit_graph) {
-   uint32_t pos;
-   int found;
-   if (item->graph_pos != COMMIT_NOT_FROM_GRAPH) {
-   pos = item->graph_pos;
-   found = 1;
-   } else {
-   found = bsearch_graph(commit_graph, 
&(item->object.oid), );
-   }
-
-   if (found)
-   return fill_commit_in_graph(item, commit_graph, pos);
-   }
-
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   return fill_commit_in_graph(item, commit_graph, pos);
return 0;
 }
 
+void load_commit_graph_info(struct commit *item)
+{
+   uint32_t pos;
+   if (!core_commit_graph)
+   return;
+   prepare_commit_graph();
+   if (commit_graph && find_commit_in_graph(item, commit_graph, ))
+   fill_commit_graph_info(item, commit_graph, pos);
+}
+
 static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit 
*c)
 {
struct object_id oid;
diff --git a/commit-graph.h b/commit-graph.h
index 260a468e73..96cccb10f3 100644
--- a/commit-graph.h
+++ b/commit-graph.h
@@ -17,6 +17,14 @@ char *get_commit_graph_filename(const char *obj_dir);
  */
 int parse_commit_in_graph(struct commit *item);
 
+/*
+ * It is possible that we loaded commit contents from the commit buffer,
+ * but we also want to ensure the commit-graph content is correctly
+ * checked and filled. Fill the graph_pos and generation members of
+ * the given commit.
+ */
+void load_commit_graph_info(struct commit *item);
+
 struct tree *get_commit_tree_in_graph(const struct commit *c);
 
 struct commit_graph {
diff --git a/commit.c b/commit.c
index a70f120878..9ef6f699bd 100644
--- a/commit.c
+++ b/commit.c
@@ -331,7 +331,7 @@ const void *detach_commit_buffer(struct commit *commit, 
unsigned long *sizep)
return ret;
 }
 
-int parse_commit_buffer(struct commit *item, const void *buffer

[PATCH v3 1/9] commit: add generation number to struct commmit

2018-04-17 Thread Derrick Stolee
The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
  more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use three special values to signal
the generation number is invalid:

GENERATION_NUMBER_INFINITY 0x
GENERATION_NUMBER_MAX 0x3FFF
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_MAX) means the generation number is too large to
store in the commit-graph file. The third (_ZERO) means the generation
number was loaded from a commit graph file that was written by a version
of git that did not support generation numbers.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 alloc.c| 1 +
 commit-graph.c | 2 ++
 commit.h   | 4 
 3 files changed, 7 insertions(+)

diff --git a/alloc.c b/alloc.c
index cf4f8b61e1..e8ab14f4a1 100644
--- a/alloc.c
+++ b/alloc.c
@@ -94,6 +94,7 @@ void *alloc_commit_node(void)
c->object.type = OBJ_COMMIT;
c->index = alloc_commit_index();
c->graph_pos = COMMIT_NOT_FROM_GRAPH;
+   c->generation = GENERATION_NUMBER_INFINITY;
return c;
 }
 
diff --git a/commit-graph.c b/commit-graph.c
index 70fa1b25fd..9ad21c3ffb 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -262,6 +262,8 @@ static int fill_commit_in_graph(struct commit *item, struct 
commit_graph *g, uin
date_low = get_be32(commit_data + g->hash_len + 12);
item->date = (timestamp_t)((date_high << 32) | date_low);
 
+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;
+
pptr = >parents;
 
edge_value = get_be32(commit_data + g->hash_len);
diff --git a/commit.h b/commit.h
index 23a3f364ed..aac3b8c56f 100644
--- a/commit.h
+++ b/commit.h
@@ -10,6 +10,9 @@
 #include "pretty.h"
 
 #define COMMIT_NOT_FROM_GRAPH 0x
+#define GENERATION_NUMBER_INFINITY 0x
+#define GENERATION_NUMBER_MAX 0x3FFF
+#define GENERATION_NUMBER_ZERO 0
 
 struct commit_list {
struct commit *item;
@@ -30,6 +33,7 @@ struct commit {
 */
struct tree *maybe_tree;
uint32_t graph_pos;
+   uint32_t generation;
 };
 
 extern int save_commit_buffer;
-- 
2.17.0.39.g685157f7fb



[PATCH v3 6/9] commit: use generation numbers for in_merge_bases()

2018-04-17 Thread Derrick Stolee
The containment algorithm for 'git branch --contains' is different
from that for 'git tag --contains' in that it uses is_descendant_of()
instead of contains_tag_algo(). The expensive portion of the branch
algorithm is computing merge bases.

When a commit-graph file exists with generation numbers computed,
we can avoid this merge-base calculation when the target commit has
a larger generation number than the target commits.

Performance tests were run on a copy of the Linux repository where
HEAD is contained in v4.13 but no earlier tag. Also, all tags were
copied to branches and 'git branch --contains' was tested:

Before: 60.0s
After:   0.4s
Rel %: -99.3%

Reported-by: Jeff King <p...@peff.net>
Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index a44899c733..bceb79c419 100644
--- a/commit.c
+++ b/commit.c
@@ -1053,12 +1053,19 @@ int in_merge_bases_many(struct commit *commit, int 
nr_reference, struct commit *
 {
struct commit_list *bases;
int ret = 0, i;
+   uint32_t min_generation = GENERATION_NUMBER_INFINITY;
 
if (parse_commit(commit))
return ret;
-   for (i = 0; i < nr_reference; i++)
+   for (i = 0; i < nr_reference; i++) {
if (parse_commit(reference[i]))
return ret;
+   if (min_generation > reference[i]->generation)
+   min_generation = reference[i]->generation;
+   }
+
+   if (commit->generation > min_generation)
+   return 0;
 
bases = paint_down_to_common(commit, nr_reference, reference);
if (commit->object.flags & PARENT2)
-- 
2.17.0.39.g685157f7fb



[PATCH v3 0/9] Compute and consume generation numbers

2018-04-17 Thread Derrick Stolee
Thanks for all the help on v2. Here are a few changes between versions:

* Removed the constant-time check in queue_has_nonstale() due to the
  possibility of a performance hit and no evidence of a performance
  benefit in typical cases.

* Reordered the commits about loading commits from the commit-graph.
  This way it is easier to demonstrate the incorrect checks. On my
  machine, every commit compiles and the test suite passes, but patches
  6-8 have the bug that is fixed in patch 9 "merge: check config before
  loading commits".

* The interaction with parse_commit_in_graph() from parse_object() is
  replaced with a new 'check_graph' parameter in parse_commit_buffer().
  This allows us to fill in the graph_pos and generation values for
  commits that are parsed directly from a buffer. This keeps the existing
  behavior that a commit parsed this way should match its buffer.

* There was discussion about making GENERATION_NUMBER_MAX assignable by
  an environment variable so we could add tests that exercise the behavior
  of capping a generation at that value. Perhaps the code around this is
  simple enough that we do not need to add that complexity.

Thanks,
-Stolee

-- >8 --

This is the one of several "small" patches that follow the serialized
Git commit graph patch (ds/commit-graph) and lazy-loading trees
(ds/lazy-load-trees).

As described in Documentation/technical/commit-graph.txt, the generation
number of a commit is one more than the maximum generation number among
its parents (trivially, a commit with no parents has generation number
one). This section is expanded to describe the interaction with special
generation numbers GENERATION_NUMBER_INFINITY (commits not in the commit-graph
file) and *_ZERO (commits in a commit-graph file written before generation
numbers were implemented).

This series makes the computation of generation numbers part of the
commit-graph write process.

Finally, generation numbers are used to order commits in the priority
queue in paint_down_to_common(). This allows a short-circuit mechanism
to improve performance of `git branch --contains`.

Further, use generation numbers for 'git tag --contains), providing a
significant speedup (at least 95% for some cases).

A more substantial refactoring of revision.c is required before making
'git log --graph' use generation numbers effectively.

This patch series is build on ds/lazy-load-trees.

Derrick Stolee (9):
  commit: add generation number to struct commmit
  commit-graph: compute generation numbers
  commit: use generations in paint_down_to_common()
  commit-graph.txt: update design document
  ref-filter: use generation number for --contains
  commit: use generation numbers for in_merge_bases()
  commit: add short-circuit to paint_down_to_common()
  commit-graph: always load commit-graph information
  merge: check config before loading commits

 Documentation/technical/commit-graph.txt | 30 +--
 alloc.c  |  1 +
 builtin/merge.c  |  5 +-
 commit-graph.c   | 99 +++-
 commit-graph.h   |  8 ++
 commit.c | 54 +++--
 commit.h |  7 +-
 object.c |  2 +-
 ref-filter.c | 23 +-
 sha1_file.c  |  2 +-
 t/t5318-commit-graph.sh  |  9 +++
 11 files changed, 199 insertions(+), 41 deletions(-)


base-commit: 7b8a21dba1bce44d64bd86427d3d92437adc4707
-- 
2.17.0.39.g685157f7fb



[PATCH v3 3/9] commit: use generations in paint_down_to_common()

2018-04-17 Thread Derrick Stolee
Define compare_commits_by_gen_then_commit_date(), which uses generation
numbers as a primary comparison and commit date to break ties (or as a
comparison when both commits do not have computed generation numbers).

Since the commit-graph file is closed under reachability, we know that
all commits in the file have generation at most GENERATION_NUMBER_MAX
which is less than GENERATION_NUMBER_INFINITY.

This change does not affect the number of commits that are walked during
the execution of paint_down_to_common(), only the order that those
commits are inspected. In the case that commit dates violate topological
order (i.e. a parent is "newer" than a child), the previous code could
walk a commit twice: if a commit is reached with the PARENT1 bit, but
later is re-visited with the PARENT2 bit, then that PARENT2 bit must be
propagated to its parents. Using generation numbers avoids this extra
effort, even if it is somewhat rare.

Signed-off-by: Derrick Stolee <dsto...@microsoft.com>
---
 commit.c | 20 +++-
 commit.h |  1 +
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/commit.c b/commit.c
index 711f674c18..a44899c733 100644
--- a/commit.c
+++ b/commit.c
@@ -640,6 +640,24 @@ static int compare_commits_by_author_date(const void *a_, 
const void *b_,
return 0;
 }
 
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused)
+{
+   const struct commit *a = a_, *b = b_;
+
+   /* newer commits first */
+   if (a->generation < b->generation)
+   return 1;
+   else if (a->generation > b->generation)
+   return -1;
+
+   /* use date as a heuristic when generataions are equal */
+   if (a->date < b->date)
+   return 1;
+   else if (a->date > b->date)
+   return -1;
+   return 0;
+}
+
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused)
 {
const struct commit *a = a_, *b = b_;
@@ -789,7 +807,7 @@ static int queue_has_nonstale(struct prio_queue *queue)
 /* all input commits in one and twos[] must have been parsed! */
 static struct commit_list *paint_down_to_common(struct commit *one, int n, 
struct commit **twos)
 {
-   struct prio_queue queue = { compare_commits_by_commit_date };
+   struct prio_queue queue = { compare_commits_by_gen_then_commit_date };
struct commit_list *result = NULL;
int i;
 
diff --git a/commit.h b/commit.h
index aac3b8c56f..64436ff44e 100644
--- a/commit.h
+++ b/commit.h
@@ -341,6 +341,7 @@ extern int remove_signature(struct strbuf *buf);
 extern int check_commit_signature(const struct commit *commit, struct 
signature_check *sigc);
 
 int compare_commits_by_commit_date(const void *a_, const void *b_, void 
*unused);
+int compare_commits_by_gen_then_commit_date(const void *a_, const void *b_, 
void *unused);
 
 LAST_ARG_MUST_BE_NULL
 extern int run_commit_hook(int editor_is_used, const char *index_file, const 
char *name, ...);
-- 
2.17.0.39.g685157f7fb



Re: [PATCHv3 00/15] replace_object.c: rename to use dash in file name

2018-04-12 Thread Derrick Stolee

On 4/11/2018 8:21 PM, Stefan Beller wrote:

v3:
* interdiff below,
   the only changes are renaming the variable
   -   struct ref_store *main_ref_store;
   +   struct ref_store *refs;
   in struct repository.
   as well as dropping the file rename patch.
* improved commit messages from discussion on the single patches.

Looks good to me. Thanks!
-Stolee


Re: [PATCH v2 07/10] commit-graph.txt: update future work

2018-04-12 Thread Derrick Stolee

On 4/12/2018 5:12 AM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


+Here is a diagram to visualize the shape of the full commit graph, and
+how different generation numbers relate:
+
++-+
+| GENERATION_NUMBER_INFINITY = 0x |
++-+
+   ||  ^
+   ||  |
+   |+--+
+   | [gen(A) = gen(B)]
+   V
++-+
+| 0 < commit->generation < 0x4000 |
++-+
+   ||  ^
+   ||  |
+   |+--+
+   |[gen(A) > gen(B)]
+   V
++-+
+| GENERATION_NUMBER_ZERO = 0  |
++-+
+|  ^
+|  |
++--+
+[gen(A) = gen(B)]

It may be just me but all I can read out of the above is that
commit->generation may store 0x, a value between 0 and
0x4000, or 0.  I cannot quite tell what the notation [gen(A)
 gen(B)] is trying to say.  I am guessing "Two generation
numbers within the 'valid' range can be compared" is what the second
one is trying to say, but it is much less interesting to know that
two infinities compare equal than how generation numbers from
different classes compare, which cannot be depicted in the above
notation, I am afraid.  For example, don't we want to say that a
commit with INF can never be reached by a commit with a valid
generation number, or something like that?


My intention with the arrows was to demonstrate where parent 
relationships can go, and the generation-number relation between a 
commit A with parent B. Clearly, this diagram is less than helpful.





  Design Details
  --
  
@@ -98,17 +141,12 @@ Future Work

  - The 'commit-graph' subcommand does not have a "verify" mode that is
necessary for integration with fsck.
  
-- The file format includes room for precomputed generation numbers. These

-  are not currently computed, so all generation numbers will be marked as
-  0 (or "uncomputed"). A later patch will include this calculation.
-
  - After computing and storing generation numbers, we must make graph
walks aware of generation numbers to gain the performance benefits they
enable. This will mostly be accomplished by swapping a commit-date-ordered
priority queue with one ordered by generation number. The following
-  operations are important candidates:
+  operation is an important candidate:

Good.




Re: [PATCH v8 03/14] commit-graph: add format document

2018-04-12 Thread Derrick Stolee

On 4/11/2018 4:58 PM, Jakub Narebski wrote:

Derrick Stolee <sto...@gmail.com> writes:


+CHUNK DATA:
+
+  OID Fanout (ID: {'O', 'I', 'D', 'F'}) (256 * 4 bytes)
+  The ith entry, F[i], stores the number of OIDs with first
+  byte at most i. Thus F[255] stores the total
+  number of commits (N).
+
+  OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
+  The OIDs for all commits in the graph, sorted in ascending order.
+
+  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)

I think it is a typo, and it should be CDAT, not CGET
(CDAT seem to me to stand for Commit DATa):

   +  Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes)

This is what you use in actual implementation, in PATCH v8 06/14

DS> +#define GRAPH_SIGNATURE 0x43475048 /* "CGPH" */
DS> +#define GRAPH_CHUNKID_OIDFANOUT 0x4f494446 /* "OIDF" */
DS> +#define GRAPH_CHUNKID_OIDLOOKUP 0x4f49444c /* "OIDL" */
DS> +#define GRAPH_CHUNKID_DATA 0x43444154 /* "CDAT" */
DS> +#define GRAPH_CHUNKID_LARGEEDGES 0x45444745 /* "EDGE" */



Documentation bugs are hard to diagnose. Thanks for finding this. I 
double checked that the hex int "0x43444154" matches "CDAT".


Here is a diff to make it match.

diff --git a/Documentation/technical/commit-graph-format.txt 
b/Documentation/technical/commit-graph-format.txt

index ad6af8105c..af03501834 100644
--- a/Documentation/technical/commit-graph-format.txt
+++ b/Documentation/technical/commit-graph-format.txt
@@ -70,7 +70,7 @@ CHUNK DATA:
   OID Lookup (ID: {'O', 'I', 'D', 'L'}) (N * H bytes)
   The OIDs for all commits in the graph, sorted in ascending order.

-  Commit Data (ID: {'C', 'G', 'E', 'T' }) (N * (H + 16) bytes)
+  Commit Data (ID: {'C', 'D', 'A', 'T' }) (N * (H + 16) bytes)
 * The first H bytes are for the OID of the root tree.
 * The next 8 bytes are for the positions of the first two parents
   of the ith commit. Stores value 0x if no parent in that



Re: [PATCH resend] SubmittingPatches: mention the git contacts command

2018-04-11 Thread Derrick Stolee

On 4/11/2018 4:20 PM, Thomas Gummerer wrote:

Instead of just mentioning 'git blame' and 'git shortlog', which make it
quite hard for new contributors to pick out the appropriate list of
people to cc on their patch series, mention the 'git contacts' utility,
which makes it much easier to get a reasonable list of contacts for a
change.

This should help new contributors pick out a reasonable cc list by
simply using a single command.

Signed-off-by: Thomas Gummerer 
---

I've originally sent this at <20180316213323.GC2224@hank>, during an
the rc period.  Eric had some comments, which I interpreted as being
okay with the change (hope I'm not mistaken there :)).  As I still
think this would be an improvement for new contributors, I'm resending
it here.


I didn't know about this tool, and it seems helpful. I plan to use it 
now. Thanks!



  Documentation/SubmittingPatches | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index a1d0feca36..945f8edb46 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -260,8 +260,8 @@ that starts with `-BEGIN PGP SIGNED MESSAGE-`.  
That is
  not a text/plain, it's something else.
  
  Send your patch with "To:" set to the mailing list, with "cc:" listing

-people who are involved in the area you are touching (the output from
-`git blame $path` and `git shortlog --no-merges $path` would help to
+people who are involved in the area you are touching (the `git
+contacts` command in `contrib/contacts/` can help to
  identify them), to solicit comments and reviews.
  
  :1: footnote:[The current maintainer: gits...@pobox.com]




Re: [PATCH 0/6] Compute and consume generation numbers

2018-04-11 Thread Derrick Stolee

On 4/11/2018 3:32 PM, Jakub Narebski wrote:

What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.


The two main questions to consider are:

1. Can X reach Y?
2. What is the set of merge-bases between X and Y?

And the thing to measure is a commit count. If possible, it would be 
good to count commits walked (commits whose parent list is enumerated) 
and commits inspected (commits that were listed as a parent of some 
walked commit). Walked commits require a commit parse -- albeit from the 
commit-graph instead of the ODB now -- while inspected commits only 
check the in-memory cache.


For git.git and Linux, I like to use the release tags as tests. They 
provide a realistic view of the linear history, and maintenance releases 
have their own history from the major releases.



I have tried finding number of false positives for level (generation
number) filter and for FELINE index, and number of false negatives for
min-post intervals in the spanning tree (for DFS tree) for 1
randomly selected pairs of commits... but I don't think this is a good
benchmark.


What is a false-positive? A case where gen(X) < gen(Y) but Y cannot 
reach X? I do not think that is a great benchmark, but I guess it is 
something to measure.



I Linux kernel sources 
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git)
that has 750832 nodes and 811733 edges, and 563747941392 possible
directed pairs, we have for 1 randomly selected pairs of commits:

   level-filter has91 =  0.91% [all] false positives
   FELINE index has78 =  0.78% [all] false positives
   FELINE index has 1.16667 less false positives than level filter

   min-post spanning-tree intervals has  3641 = 36.41% [all] false
   negatives


Perhaps something you can do instead of sampling from N^2 commits in 
total is to select a pair of generations (say, G = 2, G' = 20100) or 
regions of generations ( 2 <= G <= 20050, 20100 <= G' <= 20150) and 
see how many false positives you see by testing all pairs (one from each 
level). The delta between the generations may need to be smaller to 
actually have a large proportion of unreachable pairs. Try different 
levels, since major version releases tend to "pinch" the commit graph to 
a common history.



For git.git repository (https://github.com/git/git.git) that has 52950
nodes and 65887 edges the numbers are slighly more in FELINE index
favor (also out of 1 random pairs):

   level-filter has   504 =  9.11% false positives
   FELINE index has   125 =  2.26% false positives
   FELINE index has 4.032 less false positives than level filter

This is for FELINE which does not use level / generatio-numbers filter.


Thanks,
-Stolee



Re: [PATCH v2 06/10] commit.c: use generation to halt paint walk

2018-04-11 Thread Derrick Stolee

On 4/10/2018 11:02 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


@@ -800,17 +810,26 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return result;
}
prio_queue_put(, one);
+   if (one->generation < min_nonstale_gen)
+   min_nonstale_gen = one->generation;
  
  	for (i = 0; i < n; i++) {

twos[i]->object.flags |= PARENT2;
prio_queue_put(, twos[i]);
+   if (twos[i]->generation < min_nonstale_gen)
+   min_nonstale_gen = twos[i]->generation;
}
  
-	while (queue_has_nonstale()) {

+   while (queue_has_nonstale(, min_nonstale_gen)) {
struct commit *commit = prio_queue_get();
struct commit_list *parents;
int flags;
  
+		if (commit->generation > last_gen)

+   BUG("bad generation skip");
+
+   last_gen = commit->generation;
+
flags = commit->object.flags & (PARENT1 | PARENT2 | STALE);
if (flags == (PARENT1 | PARENT2)) {
if (!(commit->object.flags & RESULT)) {
@@ -830,6 +849,10 @@ static struct commit_list *paint_down_to_common(struct 
commit *one, int n, struc
return NULL;
p->object.flags |= flags;

Hmph.  Can a commit that used to be not stale (and contributed to
the current value of min_nonstale_gen) become stale here by getting
visited twice, invalidating the value in min_nonstale_gen?


min_nonstale_gen can be "wrong" in the way you say, but fits the 
definition from the commit message:


"To properly take advantage of this condition, track the minimum 
generation number of a commit that **enters the queue** with nonstale 
status." (Emphasis added)


You make an excellent point about how this can be problematic. I was 
confused by the lack of clear performance benefits here, but I think 
that whatever benefits making queue_has_nonstale() be O(1) were removed 
by walking more commits than necessary.


Consider the following commit graph, where M is a parent of both A and 
B, S is a parent of M and B, and there is a large set of commits 
reachable from M with generation number larger than gen(S).


A    B
| __/|
|/   |
M    |
|\   |
. |  |
. |  |
. |_/
|/
S

Between A and B, the true merge base is M. Anything reachable from M is 
marked as stale. When S is added to the queue, it is only reachable from 
B, so it is non-stale. However, it is marked stale after M is walked. 
The old code would detect this as a termination condition, but the new 
code would not.


I think this data shape is actually common (not exactly, as it may be 
that some ancestor of M provides a second path to S) especially in the 
world of pull requests and users merging master into their topic branches.


I'll remove this commit in the next version, but use the new prototype 
for queue_has_nonstale() in "commit: add short-circuit to 
paint_down_to_common()" using the given 'min_generation' instead of 
'min_nonstale_gen'.


Thanks,
-Stolee


Re: [PATCH v2 04/10] commit-graph: compute generation numbers

2018-04-11 Thread Derrick Stolee

On 4/10/2018 10:51 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


+   if ((*list)->generation != GENERATION_NUMBER_INFINITY) {
+   if ((*list)->generation > GENERATION_NUMBER_MAX)
+   die("generation number %u is too large to store in 
commit-graph",
+   (*list)->generation);
+   packedDate[0] |= htonl((*list)->generation << 2);
+   }


How serious do we want this feature to be?  On one extreme, we could
be irresponsible and say it will be a problem for our descendants in
the future if their repositories have more than billion pearls on a
single strand, and the above certainly is a reasonable way to punt.
Those who actually encounter the problem will notice by Git dying
somewhere rather deep in the callchain.

Or we could say Git actually does support a history that is
arbitrarily long, even though such a deep portion of history will
not benefit from having generation numbers in commit-graph.

I've been assuming that our stance is the latter and that is why I
made noises about overflowing 30-bit generation field in my review
of the previous step.

In case we want to do the "we know this is very large, but we do not
know the exact value", we may actually want a mode where we can
pretend that GENERATION_NUMBER_MAX is set to quite low (say 256) and
make sure that the code to handle overflow behaves sensibly.


I agree. I wonder how we can effectively expose this value into a test. 
It's probably not sufficient to manually test using compiler flags ("-D 
GENERATION_NUMBER_MAX=8").





+   for (i = 0; i < nr_commits; i++) {
+   if (commits[i]->generation != GENERATION_NUMBER_INFINITY &&
+   commits[i]->generation != GENERATION_NUMBER_ZERO)
+   continue;
+
+   commit_list_insert(commits[i], );
+   while (list) {
+...
+   }
+   }

So we go over the list of commits just _once_ and make sure each of
them gets the generation assigned correctly by (conceptually
recursively but iteratively in implementation by using a commit
list) making sure that all its parents have generation assigned and
compute the generation for the commit, before moving to the next
one.  Which sounds correct.


Yes, we compute the generation number of a commit exactly once. We use 
the list as a stack so we do not have recursion limits during our 
depth-first search (DFS). We rely on the object cache to ensure we store 
the computed generation numbers, and computed generation numbers provide 
termination conditions to the DFS.


Thanks,
-Stolee


Re: [PATCH v2 03/10] commit: add generation number to struct commmit

2018-04-11 Thread Derrick Stolee

On 4/10/2018 10:31 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


The generation number of a commit is defined recursively as follows:

* If a commit A has no parents, then the generation number of A is one.
* If a commit A has parents, then the generation number of A is one
   more than the maximum generation number among the parents of A.

Add a uint32_t generation field to struct commit so we can pass this
information to revision walks. We use two special values to signal
the generation number is invalid:

GENERATION_NUMBER_ININITY 0x
GENERATION_NUMBER_ZERO 0

The first (_INFINITY) means the generation number has not been loaded or
computed. The second (_ZERO) means the generation number was loaded
from a commit graph file that was stored before generation numbers
were computed.

Should it also be possible for a caller to tell if a given commit
has too deep a history, i.e. we do not know its generation number
exactly, but we know it is larger than 1<<30?

It seems that we only have a 30-bit field in the file, so wouldn't
we need a special value defined in (e.g. "0") so that we can tell
that the commit has such a large generation number?  E.g.


+   item->generation = get_be32(commit_data + g->hash_len + 8) >> 2;

if (!item->generation)
item->generation = GENERATION_NUMBER_OVERFLOW;

when we read it from the file?

We obviously need to do something similar when assigning a
generation number to a child commit, perhaps like

#define GENERATION_NUMBER_OVERFLOW (GENERATION_NUMBER_MAX + 1)

commit->generation = 1; /* assume no parent */
for (p = commit->parents; p; p++) {
uint32_t gen = p->item->generation + 1;

if (gen >= GENERATION_NUMBER_OVERFLOW) {
commit->generation = GENERATION_NUMBER_OVERFLOW;
break;
} else if (commit->generation < gen)
commit->generation = gen;
}
 
or something?  And then on the writing side you'd encode too large a

generation as '0'.


You raise a very good point. How about we do a slightly different 
arrangement for these overflow commits?


Instead of storing the commits in the commit-graph file as "0" (which 
currently means "written by a version of git that did not compute 
generation numbers") we could let GENERATION_NUMBER_MAX be the maximum 
generation of a commit in the commit-graph, and if a commit would have 
larger generation, we collapse it down to that value.


It slightly complicates the diagram I made in 
Documentation/technical/commit-graph.txt, but it was already a bit of a 
simplification. Here is an updated diagram, but likely we will want to 
limit discussion of the special-case GENERATION_NUMBER_MAX to the prose, 
since it is not a practical situation at the moment.


    +-+
    | GENERATION_NUMBER_INFINITY = 0x |
    +-+
  |    |    |  ^
  |    |    |  |
  |    |    +--+
  |    | [gen(A) = gen(B)]
  |    V
  |  ++
      |  | GENERATION_NUMBER_MAX = 0x3FFF |
  |  ++
  |    |    |  ^
  |    |    |  |
  |    |    +--+
  |    | [gen(A) = gen(B)]
  V    V
    +-+
    | 0 < commit->generation < 0x3FFF |
    +-+
        |    |  ^
        |    |  |
        |    +--+
        |    [gen(A) > gen(B)]
        V
    +-+
    | GENERATION_NUMBER_ZERO = 0  |
    +-+
             |  ^
             |  |
             +--+
         [gen(A) = gen(B)]

Thanks,
-Stolee


Re: [PATCH v2 02/10] merge: check config before loading commits

2018-04-11 Thread Derrick Stolee

On 4/10/2018 10:12 PM, Junio C Hamano wrote:

Derrick Stolee <dsto...@microsoft.com> writes:


diff --git a/builtin/merge.c b/builtin/merge.c
index ee050a47f3..20897f8223 100644
--- a/builtin/merge.c
+++ b/builtin/merge.c
@@ -1183,13 +1183,14 @@ int cmd_merge(int argc, const char **argv, const char 
*prefix)
branch = branch_to_free = resolve_refdup("HEAD", 0, _oid, NULL);
if (branch)
skip_prefix(branch, "refs/heads/", );
+   init_diff_ui_defaults();
+   git_config(git_merge_config, NULL);
+
if (!branch || is_null_oid(_oid))
head_commit = NULL;
else
head_commit = lookup_commit_or_die(_oid, "HEAD");
  
-	init_diff_ui_defaults();

-   git_config(git_merge_config, NULL);

Wow, that's tricky.  git_merge_config() wants to know which "branch"
we are on, and this place is as early as we can move the call to
without breaking things.  Is this to allow parse_object() called
in lookup_commit_reference_gently() to know if we can rely on the
data cached in the commit-graph data?


When I saw the bug on my machine, I tracked the issue down to a call to 
parse_commit_in_graph() that skipped the graph check since 
core_commit_graph was not set. The call stack from this call is as follows:


* lookup_commit_or_die()
* lookup_commit_reference()
* lookup_commit_reference_gently()
* parse_object()
* parse_object_buffer()
* parse_commit_in_graph() [as introduced in PATCH 01/10]




Move the config load to be between the initialization of 'branch'
and the commit lookup. Also add a test to t5318-commit-graph.sh
that exercises this code path to prevent a regression.

It is not clear to me how a successful merge of commits/8
demonstrates that reading the config earlier than before is
regression free.


I didn't want to introduce commits in an order that led to a commit 
failing tests, but if you drop the change to builtin/merge.c from this 
series, the tip commit will fail this test with "BUG: bad generation skip".


The reason for this failure is that commits/5 is loaded from HEAD from 
the object database, so its generation is marked as 
GENERATION_NUMBER_INFINITY, and the commit is marked as parsed. Later, 
the commit at merges/3 is loaded from the graph with generation 4. This 
triggers the BUG statement in paint_down_to_common(). That is why it is 
important to check a fast-forward merge.


In the 'graph_git_behavior' steps of t5318-commit-graph.sh, we were 
already testing 'git merge-base' to check the commit walk logic.


Thanks,
-Stolee


<    5   6   7   8   9   10   11   12   13   14   >