Re: [RFC] Generation Number v2

2018-11-03 Thread Jakub Narebski
Jakub Narebski  writes:
> Jakub Narebski  writes:
>> Stefan Beller  writes:
> [...]
>>> How would this impact creation of a commit?
>>>
>>> The current generation numbers can be lazily updated or not
>>> updated at all. In my understanding of the maximum generation
>>> numbers, a new commit would make these maximum generation
>>> numbers invalid (i.e. they have to be recomputed).
> [...]
>>> For the V2 maximum generation numbers, would we need to
>>> rewrite the numbers for all commits once we recompute them?
>>> Assuming that is true, it sounds like the benchmark doesn't
>>> cover the whole costs associated with V2, which is why the
>>> exceptional performance can be explained.
>>
>> Let's check it using a simple example
>>
>> First, (minimum) parent-based generation numbers before and after
>> extending the commit graph:
>>
>>   1   2 3   4 5   6   7new
>>   1   2 3   4 5   -   -old
>>   .---.-.---.-.---*---*
>>\
>> \   3   4 5   6new
>>  \  3   4 5   6old
>>   \-.---.-.---.
>>  \
>>   \   5new
>>\  -old
>> \-*
>
> Let me check yet another idea, using (minimum) parent-based V0 generation
> numbers (counting distance from the sink / root) as a starting number
> for source / heads commits.
[...]

> [...] but let's check another example
>
>1   2 3   4   5   6 7   8   9   new
>1   2 3   4   5   6 7   8   -   old
>.---.-.---.---.---.-.---.---*
> \ /
>  \   3   4   / 5   6   7   8   new
>   \  5   6  /  -   -   -   -   old
>\-.---.-/---*---*---*---*

But let's do this correctly.


   1   2 3   4  5   6 7   8   9  new
   1   2 3   4  5   6 7   8   -  old
   .---.-.---.--.---.-.---.---*
\/
 \   3   4  /new
  \  5   6 / old
   \-.---./
  \
   \5   6 7   8  new
\   -   - -   -  old
 \--*---*-*---*

Well, it looks as if I draw it incorrectly, but performed calculations
right.  You may need to modify / change some data, but it looks as if it
is not that much of a problem.

The new version of the maximum generation numbers looks like it gives
the same results as generation numbers for the "longest" path, and
update may affect only the side-branches that were added to.  All
branches merged into the trunk, and not added to should be safe with
respect to updating.

Can anyone here prove a thing about update of those modified maximum
generation numbers?  Thanks in advance.

Best,
-- 
Jakub Narębski


Re: [RFC] Generation Number v2

2018-11-03 Thread Jakub Narebski
Derrick Stolee  writes:
> On 11/1/2018 8:27 AM, Jakub Narebski wrote:
>> Derrick Stolee  writes:
>>
>>> Please also let me know about any additional tests that I could
>>> run. Now that I've got a lot of test scripts built up, I can re-run
>>> the test suite pretty quickly.
>>
>> I would add straighforward 1-to-N and N-to-1 reachability tests in the
>> form of `git branch / tag --contains` and `git branch / tag --merged`,
>> and perhaps also ahead-behind calculations used by `git status`, and
>> N-to-M reachability tests used by tag following code in push and fetch.

1-to-N and N-to-1 can be done with `git branch --merged` / `--contains`.

>> Possibly also A...B walk, if it is not implemented via calculating
>> merge-base.
>
> I believe this uses paint_down_to_common(), but looks at the PARENT1 and
> PARENT2 flags afterward to determine the left/right/both relationships.

Right, so I guess this is the same internal mechanism that `git
merge-base` utilizes, just used a bit differently.  Thus benchmarking
`git merge-base` should cover also A...B performance, I guess.

>> Maybe even --ancestry-path walk, though I am not sure how important best
>> performance for rhis case is (we would want good performance, but
>> perhaps best is not needed for rarely used command).
>
> Currently, the implementation of --ancestry-path does not use a
> reachability index.

Well, using reachability index would certainly give it a boost.

[...]
>>> Generation Number Performance Test
>>> ==
>>>
>>> Git uses the commit history to answer many basic questions, such as
>>> computing a merge-base. Some of these algorithms can benefit from a
>>> _reachability index_, which is a condition that allows us to answer
>>> "commit A cannot reach commit B" in some cases. These require pre-
>>> computing some values that we can load and use at run-time. Git
>>> already has a notion of _generation number_, stores it in the commit-
>>> graph file, and uses it in several reachability algorithms.
>>
>> Note that there are other kinds of reachability indices.
>>
>> First, there are reachability indices that can answer the full
>> reachability query (if commit A can reach commit B, or if commit A
>> cannot reach commit B) directly, without walking the commit graph at
>> all: so called label-only approach.  For example one could store for
>> each commit the compressed list of all commits reachable from it
>> (transitive closure compression).
>>
>> Those, I think (but I have not checked), would be of not much use for
>> Git, as the size of the index grows stronger than linear with the number
>> of commits, as grows the time to compute such index.  So probably of no
>> use to Git, at least not directly (Git uses so called "bitmap index",
>> see e.g. [1], which stores reachability bit-vector as compressed
>> bitmap... but not for all commits, only for a small subset).
>>
>>
>> Second, beside negative-cut reachability indexes, that can answer with
>> certainity that "commit A cannot reach commit B", like the generation
>> numbers (also known as level, or topological level), there also
>> positive-cut indexes (usually if not always based on the spanning tree
>> or trees for the DAG), that can answer when "commit A can reach commit
>> B".
>>
>> I think that those can lead to significant speedups for at least some
>> types of commit traversal and reachability queries that Git needs to
>> answer.  But currently no algorithms has a provision for using
>> positive-cut filter index.  Anyway, such index would be auxiliary thing,
>> see the next point.
>>
>>
>> Third, we require more than having reachability index in the sense of
>> having some kind of _labelling_, often composite, that can answer either
>> "commit A cannot reach commit B" or "commit A can reach commit B",
>> depending on the type.  Git for most operations needs more, namely an
>> actual _ordering_, that is a weak order (or to be more exact a total
>> preorder, i.e. there can be two different commits with the same
>> "generation number" or index, but always either idx(A) ≲ idx(B) or
>> idx(B) ≲ idx(A)) and not only partial ordering (where there can be
>> incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
>> This is because it needs to be used in priority queue to decide which
>> commit to travel next; more on that later.  This is also why there can
>> be only one such "main" reachability _index_.
>>
>> [1]: https://githubengineering.com/counting-objects/
>
> Thanks for the details. At the moment, I'm only interested in improving our
> negative-cut reachability index, as we have algorithms that can take
> advantage of them (and can compare their performance directly).

Right, I can agree with that.  Positive-cut reachability index use would
need to be added separately.

What I wanted to emphasize is the need for a _number_ (or a total
preorder), not simply an _index_ (a label, perhaps a composite one).
This means, as I wrote, that there would 

Re: [RFC] Generation Number v2

2018-11-02 Thread Jakub Narebski
Derrick Stolee  writes:
> On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:
>> On Tue, Oct 30 2018, Junio C Hamano wrote:
>>> Derrick Stolee  writes:

 In contrast, maximum generation numbers and corrected commit
 dates both performed quite well. They are frequently the top
 two performing indexes, and rarely significantly different.

 The trade-off here now seems to be: which _property_ is more important,
 locally-computable or backwards-compatible?
>>>
>>> Nice summary.
>>>
>>> As I already said, I personally do not think being compatible with
>>> currently deployed clients is important at all (primarily because I
>>> still consider the whole thing experimental), and there is a clear
>>> way forward once we correct the mistake of not having a version
>>> number in the file format that tells the updated clients to ignore
>>> the generation numbers.  For longer term viability, we should pick
>>> something that is immutable, reproducible, computable with minimum
>>> input---all of which would lead to being incrementally computable, I
>>> would think.
>>
>> I think it depends on what we mean by backwards compatibility. None of
>> our docs are saying this is experimental right now, just that it's
>> opt-in like so many other git-config(1) options.
>>
>> So if we mean breaking backwards compatibility in that we'll write a new
>> file or clobber the existing one with a version older clients can't use
>> as an optimization, fine.
>>
>> But it would be bad to produce a hard error on older clients, but
>> avoiding that seems as easy as just creating a "commit-graph2" file in
>> .git/objects/info/.
>
> Well, we have a 1-byte version number following the "CGPH" header in
> the commit-graph file, and clients will ignore the commit-graph file
> if that number is not "1". My hope for backwards-compatibility was
> to avoid incrementing this value and instead use the unused 8th byte.

How?  Some of the considered new generation numbers were backwards
compatibile in the sense that for graph traversal the old code for
(minimum) generation numbers could be used.  But that is for reading.
If the old client tried to update new generation number using old
generation number algorithm (assuming locality), it would write some
chimaera that may or may not work.

> However, it appears that we are destined to increment that version
> number, anyway. Here is my list for what needs to be in the next
> version of the commit-graph file format:

As a reminder, here is how the commit-graph format looks now [1]

  HEADER:
4-byte signature:  CGPH
1-byte version number: 1
1-byte Hash Version:   1 = SHA-1
1-byte number (C) of "chunks": 3 or 4
1-byte (reserved for later use)
  CHUNK LOOKUP:
(C + 1) * 12 bytes listing the table of contents for the chunks
  CHUNK DATA:
OIDF
OIDL
CDAT
EDGE [optional, depends on graph structure]
  TRAILER:
checksum

[1]: Documentation/technical/commit-graph-format.txt


> 1. A four-byte hash version.

Why do you need a four-byte hash version?  255 possible hash functions
is not enough?

> 2. File incrementality (split commit-graph).

This would probably require a segment lookup part, and one chunk of each
type per segment (or even one chunk lookup table per segment).  It may
or may not need generation number which can support segmenting (here
maximal generation numbers have the advantage); but you can assume that
if segment(A) > segment(B) then reach.ix(A) > reach.ix(B),
i.e. lexicographical-like sort.

> 3. Reachability Index versioning

I assume that you mean here versioning for the reachability index that
is stored in the CDAT section (if it is in a separate chunk, it should
be easy to version).

The easiest thing to do when encountering unknown or not supported
generation number (perhaps the commit-graph file was created in the past
by earlier version of Git, perahs it came from server with newer Git, or
perhaps the same repository is sometimes accessed using newer and
sometimes older Git version), is to set commit->generation_number to
GENERATION_NUMBER_ZERO, as you wrote in [2].

We could encode backward-compatibility (for using) somewhat, but I
wonder how much it would be of use.

[2]: 
https://public-inbox.org/git/61a829ce-0d29-81c9-880e-7aef1bec9...@gmail.com/

> Most of these changes will happen in the file header. The chunks
> themselves don't need to change, but some chunks may be added that
> only make sense in v2 commit-graphs.

What I would like to see in v2 commit-graph format would be some kind
convention for naming / categorizing chunks, so that Git would be able
to handle unknown chunks gracefully.

In PNG format there are various types of chunks.  Quoting Wikipedia:

  Chunk types [in PNG format] are given a four-letter case sensitive
  ASCII type/name; compare FourCC. The case of the different letters in
  the name (bit 5 of the numeric value of the character) is a bit 

Re: [RFC] Generation Number v2

2018-11-02 Thread Jakub Narebski
Derrick Stolee  writes:
> On 10/29/2018 11:59 PM, Junio C Hamano wrote:
>> Derrick Stolee  writes:
[...]
>>> * **Compatible?** In our test implementation, we use a previously unused
>>>byte of data in the commit-graph format to indicate which reachability
>>>index version we are using. Existing clients ignore this value, so we
>>>will want to consider if these new indexes are _backwards compatible_.
>>>That is, will they still report correct values if they ignore this byte
>>>and use the generation number column from the commit-graph file assuming
>>>the values are minimum generation numbers?
>>
>> I personally consider that the current commit-graph with generation
>> numbers experimental, so I am not sure how much we care about this.
>>
>> Having said that.
>>
>> By the above definition, any new index that is wider than the
>> current generation number cannot be compatible (can we even tell the
>> existing clients how wide each elements in the ix array is?)
>>
>> In any case, perhaps the first thing to do is to update the clients
>> so that they stop ignoring the version number field, and instead
>> work without generation number when there is no version of reach.ix
>> available in the file?  That way, a better reachablility index can
>> be chosen freely without having to worry about the compatibility.
>
> I can work on that. It should be as simple as setting commit->generation to
> GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
> has a different version.

That is a very good idea, but we have here a chicken-and-egg problem.
The only way to denote that the generation numbers / reachability index
have changed (for reading/using: that it changed in backwards
incompatibile way; for update: that it changed at all) is to change the
format version:

  1-byte version number:
  Currently, the only valid version is 1.

But if we assume that different commit-graph format version simply means
that commit->generation is to be set to GENERATION_NUMBER_ZERO, then we
block possible changes to the structure of the commit-graph file.

[Reads further in thread]

Ah, I see that's why you want to introduce [1]:

DS>
DS> 3. Reachability Index versioning

[1]: 
https://public-inbox.org/git/6902dbff-d9f6-e897-2c20-d0cb47a50...@gmail.com/

>>> * **Immutable?** Git objects are _immutable_. If you change an object you
>>>actually create a new object with a new object ID. Are the values we 
>>> store
>>>for these reachability indexes also immutable?
>>
>> Even if we do not embed the reachability ix in commit objects,
>> having an immutable value is probably a must if we want to make them
>> incrementally computable, so this is a very good property to have.
>> Unless there is a clever idea to incrementally compute a mutable
>> reach.ix, my gut instinct says that this property is a must.

I think that the reachability index can be mutable and non-local, but
still being able to be incrementally updated, though perhaps it would
require not only calculations for new commits, but recalculations for
some limited subset of commits existing in the commit-graph file.

I'm trying to modify exact definition of maximal generation numbers to
check if I can find out a version that exhibits nearly-incremental
updates property.

>> Another thing, perhaps related to "local" below, is if exactly the
>> same reach.ix is computed by anybody, given an identical commit
>> history graph (perhaps "reproducibility"?).  I think most of the
>> candidates you listed are reproducible without a fixed tie-breaker,
>> but I am not sure about felineY() thing.

felineY() is deterministic (or can be made deterministic) for given
felineX() (i.e. given [reverse] topological order).

>>> * **Local?** Are these values **locally computable**? That is, do we only
>>>need to look at the parents of a commit (assuming those parents have
>>>computed values) in order to determine the value at that commit?
>>
>> A subset of non-local reachability ix, for example, the ones that
>> need to know what each commit's children are, cannot be immutable,
>> as adding new objects to the graph (either with locally committing,
>> or transferring objects from other repositories) would affect the
>> ix; is this true for all non-local reachability ix, I wonder?
>
> As a thought experiment, we could define a function size(C) to be the
> number of commits reachable from C.

Let's call it reach(C), instead ;-)

>This is not locally-computable
> from the size values at C's parents due to the inclusion-exclusion
> principle. We would need to compute it by walking the reachable set
> and counting (resulting in quadratic performance overall) but is
> immutable. Since the performance cost is so expensive (unlike the
> linear costs in the other non-local versions) I didn't include it
> in my comparison.

Let's define Reach(C) as set of all commits reachable from commit C.
Then reach(C) = ||Reach(C)||, where 

Re: [RFC] Generation Number v2

2018-11-02 Thread Jakub Narebski
Jakub Narebski  writes:
> Stefan Beller  writes:
[...]
>> How would this impact creation of a commit?
>>
>> The current generation numbers can be lazily updated or not
>> updated at all. In my understanding of the maximum generation
>> numbers, a new commit would make these maximum generation
>> numbers invalid (i.e. they have to be recomputed).
[...]
>> For the V2 maximum generation numbers, would we need to
>> rewrite the numbers for all commits once we recompute them?
>> Assuming that is true, it sounds like the benchmark doesn't
>> cover the whole costs associated with V2, which is why the
>> exceptional performance can be explained.
>
> Let's check it using a simple example
>
> First, (minimum) parent-based generation numbers before and after
> extending the commit graph:
>
>   1   2 3   4 5   6   7new
>   1   2 3   4 5   -   -old
>   .---.-.---.-.---*---*
>\
> \   3   4 5   6new
>  \  3   4 5   6old
>   \-.---.-.---.
>  \
>   \   5new
>\  -old
> \-*

Let me check yet another idea, using (minimum) parent-based V0 generation
numbers (counting distance from the sink / root) as a starting number
for source / heads commits.

1   2 3   4 5   6   7new
1   2 3   4 5   -   -old
.---.-.---.-.---*---*
 \
  \   3   4 5   6new
   \  3   4 5   6old
\-.---.-.---.
   \
\   5new
 \  -old
  \-*


Well, on this example it looks like this variant of maximum generation
numbers can be done incrementally, but let's check another example

   1   2 3   4   5   6 7   8   9   new
   1   2 3   4   5   6 7   8   -   old
   .---.-.---.---.---.-.---.---*
\ /
 \   3   4   / 5   6   7   8   new
  \  5   6  /  -   -   -   -   old
   \-.---.-/---*---*---*---*

It looks like it doesn; give as good results as I thought.  Less values
are changed, but you would still need to recalculate them, unless it can
be proven that they do not need it.

>
> Next, maximum generation numbers.  We start with 9 commits, and we end
> up with 12 commits after the change
>
>   6   7 8   9 10  11  12   new
>   4   5 7   8 9   -   -old
>   .---.-.---.-.---*---*
>\
> \   9   1011  12   new
>  \  6   7 8   9old
>   \-.---.-.---.
>  \
>   \   12   new
>\  -old
> \-*
>
>
> As you can see all maximum generation numbers got rewritten.
>
> Though if instead using the number of commits, we use the maximum
> generation number, or in other words the length of the longest path, we
> get the following:
>
>   1   2 3   4 5   6   7new
>   1   2 4   5 6   -   -old
>   .---.-.---.-.---*---*
>\
> \   4   5 6   7new
>  \  3   4 5   6old
>   \-.---.-.---.
>  \
>   \   7new
>\  -old
> \-*
>
> A bit better, but still much change in numbers.

-- 
Jakub Narębski


Re: [RFC] Generation Number v2

2018-11-01 Thread Jakub Narebski
Derrick Stolee  writes:

> Here is a re-formatted version of the tables I introduced earlier.
> The tables were too wide for public-inbox to render correctly (when
> paired with my email client). Hopefully this bulleted-list format
> works better. Thanks, Stefan, for pointing out the rendering
> problems!
>
> ### Test 1: `git log --topo-order -N`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order` before writing `N` commits to output.
>
> You can reproduce this test using `topo-order-tests.sh` and
> see all the data in `topo-order-summary.txt`. The values reported
> here are a sampling of the data, ignoring tests where all values
> were the same or extremely close in value.
>
> android-base, N = 100
>     V0: 5487
>     V1: 8534
>     V2: 6937
>     V3: 6419
>     V4: 6453
[...]

> ### Test 2: `git log --topo-order -10 A..B`
[...]
> android-base 53c1972bc8f 92f18ac3e39
>   OLD: 39403
>    V0:  1544
>    V1:  6957
>    V2:    26
>    V3:  1015
>    V4:  1098

Two minor issues.  First, now that the data is not in the table format,
where every bit of horizontal space matters, why not spell the names of
new proposed "generation numbers" (or rather reachability orderings) in
full, instead of using V0...V4 shortcuts?  It is not as if we were short
of space.

Second, why OLD data is present in tests 2 and 3, but not in test 1?

Best,
-- 
Jakub Narębski


Re: [RFC] Generation Number v2

2018-11-01 Thread Jakub Narebski
Stefan Beller  writes:

>> Based on the performance results alone, we should remove minimum
>> generation numbers, (epoch, date) pairs, and FELINE index from
>> consideration. There are enough examples of these indexes performing
>> poorly.
>>
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
>>
>> * Maximum generation number is backwards-compatible but not
>>locally-computable or immutable.
>
> These maximum generation numbers sound like the reverse of
> the generation numbers as they are currently implemented, i.e.
> we count all commits between the commit A and all heads.

Actually

  maximum generation number =
  number of commits  -  reverse generation number

> How would this impact creation of a commit?
>
> The current generation numbers can be lazily updated or not
> updated at all. In my understanding of the maximum generation
> numbers, a new commit would make these maximum generation
> numbers invalid (i.e. they have to be recomputed).
>
> Are there ways out by deferring the computation of maximum
> generation numbers while still being able to work with some
> commits that are un-accounted for?

As I understand it, new commits added since commit-graph file was
created would get simply INFINITY maximum/maximal generation numbers (if
we were using reverse generation numbers, new commits would get ZERO for
generation number).

> When recomputing these numbers, the current generation number
> (V0) has the property that already existing numbers can be re-used
> as-is. We only need to compute the numbers for new commits,
> and then insert this to the appropriate data structures (which is
> a separate problem, one could imagine a split generation
> numbers file like the split index)

Sidenote: I wonder if there is some on-disk data structure with
similarly fast read, but easier update.

> For the V2 maximum generation numbers, would we need to
> rewrite the numbers for all commits once we recompute them?
> Assuming that is true, it sounds like the benchmark doesn't
> cover the whole costs associated with V2, which is why the
> exceptional performance can be explained.

Let's check it using a simple example

First, (minimum) parent-based generation numbers before and after
extending the commit graph:

  1   2 3   4 5   6   7new
  1   2 3   4 5   -   -old
  .---.-.---.-.---*---*
   \
\   3   4 5   6new
 \  3   4 5   6old
  \-.---.-.---.
 \
  \   5new
   \  -old
\-*


Next, maximum generation numbers.  We start with 9 commits, and we end
up with 12 commits after the change

  6   7 8   9 10  11  12   new
  4   5 7   8 9   -   -old
  .---.-.---.-.---*---*
   \
\   9   1011  12   new
 \  6   7 8   9old
  \-.---.-.---.
 \
  \   12   new
   \  -old
\-*


As you can see all maximum generation numbers got rewritten.

Though if instead using the number of commits, we use the maximum
generation number, or in other words the length of the longest path, we
get the following:

  1   2 3   4 5   6   7new
  1   2 4   5 6   -   -old
  .---.-.---.-.---*---*
   \
\   4   5 6   7new
 \  3   4 5   6old
  \-.---.-.---.
 \
  \   7new
   \  -old
\-*

A bit better, but still much change in numbers.

[...]
>>
>> * Corrected commit-date is locally-computable and immutable,
>>but not backwards-compatible.
>
> How are these dates not backwards incompatible?
> We don't have to expose these dates to the user, but
> - just like generation numbers - could store them and use them
> but not tell the user about it.
>
> We would need to be really careful to not expose them at all
> as they look like the real dates, so that could make for an
> awkward bug report.
>
> The approach of "corrected commit date" sounds like we could
> have a very lazy approach, i.e. no extra data structures needed
> for many commits (as the corrected date equals the real date)
> and only need to store the corrections for some commits.
> Such an approach however would not make it easy to know
> if we operate on corrected dates, or if we even checked them
> for correctness before.
>
> So if we'd have an additional row in the generation numbers file
> telling the corrected date, then we should be able to be backwards
> compatible?

Here, from what I understand, being "backwards 

Re: [RFC] Generation Number v2

2018-11-01 Thread Derrick Stolee

On 11/1/2018 8:27 AM, Jakub Narebski wrote:

[I have noticed that in some places I wrote A..B instead of B..A.  Sorry
  about that]

Derrick Stolee  writes:


Please also let me know about any additional tests that I could
run. Now that I've got a lot of test scripts built up, I can re-run
the test suite pretty quickly.

I would add straighforward 1-to-N and N-to-1 reachability tests in the
form of `git branch / tag --contains` and `git branch / tag --merged`,
and perhaps also ahead-behind calculations used by `git status`, and
N-to-M reachability tests used by tag following code in push and fetch.

Possibly also A...B walk, if it is not implemented via calculating
merge-base.


I believe this uses paint_down_to_common(), but looks at the PARENT1 and
PARENT2 flags afterward to determine the left/right/both relationships.


Maybe even --ancestry-path walk, though I am not sure how important best
performance for rhis case is (we would want good performance, but
perhaps best is not needed for rarely used command).


Currently, the implementation of --ancestry-path does not use a
reachability index.



See explanation below.



Generation Number Performance Test
==

Git uses the commit history to answer many basic questions, such as
computing a merge-base. Some of these algorithms can benefit from a
_reachability index_, which is a condition that allows us to answer
"commit A cannot reach commit B" in some cases. These require pre-
computing some values that we can load and use at run-time. Git
already has a notion of _generation number_, stores it in the commit-
graph file, and uses it in several reachability algorithms.

Note that there are other kinds of reachability indices.

First, there are reachability indices that can answer the full
reachability query (if commit A can reach commit B, or if commit A
cannot reach commit B) directly, without walking the commit graph at
all: so called label-only approach.  For example one could store for
each commit the compressed list of all commits reachable from it
(transitive closure compression).

Those, I think (but I have not checked), would be of not much use for
Git, as the size of the index grows stronger than linear with the number
of commits, as grows the time to compute such index.  So probably of no
use to Git, at least not directly (Git uses so called "bitmap index",
see e.g. [1], which stores reachability bit-vector as compressed
bitmap... but not for all commits, only for a small subset).


Second, beside negative-cut reachability indexes, that can answer with
certainity that "commit A cannot reach commit B", like the generation
numbers (also known as level, or topological level), there also
positive-cut indexes (usually if not always based on the spanning tree
or trees for the DAG), that can answer when "commit A can reach commit
B".

I think that those can lead to significant speedups for at least some
types of commit traversal and reachability queries that Git needs to
answer.  But currently no algorithms has a provision for using
positive-cut filter index.  Anyway, such index would be auxiliary thing,
see the next point.


Third, we require more than having reachability index in the sense of
having some kind of _labelling_, often composite, that can answer either
"commit A cannot reach commit B" or "commit A can reach commit B",
depending on the type.  Git for most operations needs more, namely an
actual _ordering_, that is a weak order (or to be more exact a total
preorder, i.e. there can be two different commits with the same
"generation number" or index, but always either idx(A) ≲ idx(B) or
idx(B) ≲ idx(A)) and not only partial ordering (where there can be
incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
This is because it needs to be used in priority queue to decide which
commit to travel next; more on that later.  This is also why there can
be only one such "main" reachability _index_.

[1]: https://githubengineering.com/counting-objects/


Thanks for the details. At the moment, I'm only interested in improving our
negative-cut reachability index, as we have algorithms that can take 
advantage

of them (and can compare their performance directly).




You can read more about generation numbers and how to use them in
algorithms in [this blog
post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).

However, [generation numbers do not always improve our
algorithms](https://public-inbox.org/git/pull.28.git.gitgitgad...@gmail.com/T/#u).
Specifically, some algorithms in Git already use commit date as a
heuristic reachability index. This has some problems, though, since
commit date can be incorrect for several reasons (clock skew between
machines, purposefully setting GIT_COMMIT_DATE to the author date,
etc.).

That's what we use the "slop" mechanism for: Git would walk a few
commits more than necessary than if there were no clock skew (if 

Re: [RFC] Generation Number v2

2018-11-01 Thread Jakub Narebski
[I have noticed that in some places I wrote A..B instead of B..A.  Sorry
 about that]

Derrick Stolee  writes:

> We've discussed in several places how to improve upon generation
> numbers. This RFC is a report based on my investigation into a
> few new options, and how they compare for Git's purposes on
> several existing open-source repos.
>
> You can find this report and the associated test scripts at
> https://github.com/derrickstolee/gen-test.

Which use prototype test implementation from the `reach-perf` branch in
https://github.com/derrickstolee/git.

> I have some more work to do in order to make this fully
> reproducible (including a single script to clone all of the
> repos I used, compute the commit-graphs using the different
> index versions, and run the tests).
>
> TL;DR: I recommend we pursue one of the following two options:
>
> 1. Maximum Generation Number.
> 2. Corrected Commit Date.
>
> Each of these perform well, but have different implementation
> drawbacks. I'd like to hear what everyone thinks about those
> drawbacks.

I agree with Junio that incremental computation is more important than
backwards compatibility (that the old clients can work with the new
commit-graph without changes).  We would have compatibility breaking
anyway with the planned move from SHA-1 to NewHash aka. SHA-256.

> Please also let me know about any additional tests that I could
> run. Now that I've got a lot of test scripts built up, I can re-run
> the test suite pretty quickly.

I would add straighforward 1-to-N and N-to-1 reachability tests in the
form of `git branch / tag --contains` and `git branch / tag --merged`,
and perhaps also ahead-behind calculations used by `git status`, and
N-to-M reachability tests used by tag following code in push and fetch.

Possibly also A...B walk, if it is not implemented via calculating
merge-base.

Maybe even --ancestry-path walk, though I am not sure how important best
performance for rhis case is (we would want good performance, but
perhaps best is not needed for rarely used command).

See explanation below.


> Generation Number Performance Test
> ==
>
> Git uses the commit history to answer many basic questions, such as
> computing a merge-base. Some of these algorithms can benefit from a
> _reachability index_, which is a condition that allows us to answer
> "commit A cannot reach commit B" in some cases. These require pre-
> computing some values that we can load and use at run-time. Git
> already has a notion of _generation number_, stores it in the commit-
> graph file, and uses it in several reachability algorithms.

Note that there are other kinds of reachability indices.

First, there are reachability indices that can answer the full
reachability query (if commit A can reach commit B, or if commit A
cannot reach commit B) directly, without walking the commit graph at
all: so called label-only approach.  For example one could store for
each commit the compressed list of all commits reachable from it
(transitive closure compression).

Those, I think (but I have not checked), would be of not much use for
Git, as the size of the index grows stronger than linear with the number
of commits, as grows the time to compute such index.  So probably of no
use to Git, at least not directly (Git uses so called "bitmap index",
see e.g. [1], which stores reachability bit-vector as compressed
bitmap... but not for all commits, only for a small subset).


Second, beside negative-cut reachability indexes, that can answer with
certainity that "commit A cannot reach commit B", like the generation
numbers (also known as level, or topological level), there also
positive-cut indexes (usually if not always based on the spanning tree
or trees for the DAG), that can answer when "commit A can reach commit
B".

I think that those can lead to significant speedups for at least some
types of commit traversal and reachability queries that Git needs to
answer.  But currently no algorithms has a provision for using
positive-cut filter index.  Anyway, such index would be auxiliary thing,
see the next point.


Third, we require more than having reachability index in the sense of
having some kind of _labelling_, often composite, that can answer either
"commit A cannot reach commit B" or "commit A can reach commit B",
depending on the type.  Git for most operations needs more, namely an
actual _ordering_, that is a weak order (or to be more exact a total
preorder, i.e. there can be two different commits with the same
"generation number" or index, but always either idx(A) ≲ idx(B) or
idx(B) ≲ idx(A)) and not only partial ordering (where there can be
incomparable elements, i.e neither idx(A) ≼ idx(B) nor idx(B) ≼ idx(A)).
This is because it needs to be used in priority queue to decide which
commit to travel next; more on that later.  This is also why there can
be only one such "main" reachability _index_.

[1]: https://githubengineering.com/counting-objects/

> You can 

Re: [RFC] Generation Number v2

2018-10-31 Thread Derrick Stolee

On 10/31/2018 8:54 AM, Ævar Arnfjörð Bjarmason wrote:

On Tue, Oct 30 2018, Junio C Hamano wrote:


Derrick Stolee  writes:

In contrast, maximum generation numbers and corrected commit
dates both performed quite well. They are frequently the top
two performing indexes, and rarely significantly different.

The trade-off here now seems to be: which _property_ is more important,
locally-computable or backwards-compatible?

Nice summary.

As I already said, I personally do not think being compatible with
currently deployed clients is important at all (primarily because I
still consider the whole thing experimental), and there is a clear
way forward once we correct the mistake of not having a version
number in the file format that tells the updated clients to ignore
the generation numbers.  For longer term viability, we should pick
something that is immutable, reproducible, computable with minimum
input---all of which would lead to being incrementally computable, I
would think.

I think it depends on what we mean by backwards compatibility. None of
our docs are saying this is experimental right now, just that it's
opt-in like so many other git-config(1) options.

So if we mean breaking backwards compatibility in that we'll write a new
file or clobber the existing one with a version older clients can't use
as an optimization, fine.

But it would be bad to produce a hard error on older clients, but
avoiding that seems as easy as just creating a "commit-graph2" file in
.git/objects/info/.


Well, we have a 1-byte version number following the "CGPH" header in
the commit-graph file, and clients will ignore the commit-graph file
if that number is not "1". My hope for backwards-compatibility was
to avoid incrementing this value and instead use the unused 8th byte.

However, it appears that we are destined to increment that version
number, anyway. Here is my list for what needs to be in the next
version of the commit-graph file format:

1. A four-byte hash version.

2. File incrementality (split commit-graph).

3. Reachability Index versioning

Most of these changes will happen in the file header. The chunks
themselves don't need to change, but some chunks may be added that
only make sense in v2 commit-graphs.

Thanks,
-Stolee


Re: [RFC] Generation Number v2

2018-10-31 Thread Ævar Arnfjörð Bjarmason


On Tue, Oct 30 2018, Junio C Hamano wrote:

> Derrick Stolee  writes:
>> In contrast, maximum generation numbers and corrected commit
>> dates both performed quite well. They are frequently the top
>> two performing indexes, and rarely significantly different.
>>
>> The trade-off here now seems to be: which _property_ is more important,
>> locally-computable or backwards-compatible?
>
> Nice summary.
>
> As I already said, I personally do not think being compatible with
> currently deployed clients is important at all (primarily because I
> still consider the whole thing experimental), and there is a clear
> way forward once we correct the mistake of not having a version
> number in the file format that tells the updated clients to ignore
> the generation numbers.  For longer term viability, we should pick
> something that is immutable, reproducible, computable with minimum
> input---all of which would lead to being incrementally computable, I
> would think.

I think it depends on what we mean by backwards compatibility. None of
our docs are saying this is experimental right now, just that it's
opt-in like so many other git-config(1) options.

So if we mean breaking backwards compatibility in that we'll write a new
file or clobber the existing one with a version older clients can't use
as an optimization, fine.

But it would be bad to produce a hard error on older clients, but
avoiding that seems as easy as just creating a "commit-graph2" file in
.git/objects/info/.


Re: [RFC] Generation Number v2

2018-10-31 Thread Derrick Stolee

On 10/29/2018 11:59 PM, Junio C Hamano wrote:

Derrick Stolee  writes:


**V3: Corrected Commit Date.**
For a commit C, let its _corrected commit date_ (denoted by cdate(C))
be the maximum of the commit date of C and the commit dates of its
parents.

"maximum of the commit date of C and the corrected commit dates of
its parents"?


That's what I mean. Thanks.



We've talked about exactly this one in the past (long before any of
Microsoft folks other than Dscho came to the scene) but without an
implementation, or detailed design and analysis.  I am very happy to
see the idea among other possibilities to be considered again.  This
time around, we may finally come up with something better than the
"commit dates with SLOP" thing ;-).


Essentially, the felineY order is selected with the goal of swapping
positions of topologically-independent commits relative to the felinX
ordering. The resulting reachability index is as follows:

If felineX(A) < felineY(B), then A cannot reach B.
If felineY(A) < felineY(B), then A cannot reach B.

I presume that the first line is a typo and you compare the same X index?


Yes, sorry for the typos. I fixed them in the report on GitHub.




* **Compatible?** In our test implementation, we use a previously unused
   byte of data in the commit-graph format to indicate which reachability
   index version we are using. Existing clients ignore this value, so we
   will want to consider if these new indexes are _backwards compatible_.
   That is, will they still report correct values if they ignore this byte
   and use the generation number column from the commit-graph file assuming
   the values are minimum generation numbers?

I personally consider that the current commit-graph with generation
numbers experimental, so I am not sure how much we care about this.

Having said that.

By the above definition, any new index that is wider than the
current generation number cannot be compatible (can we even tell the
existing clients how wide each elements in the ix array is?)

In any case, perhaps the first thing to do is to update the clients
so that they stop ignoring the version number field, and instead
work without generation number when there is no version of reach.ix
available in the file?  That way, a better reachablility index can
be chosen freely without having to worry about the compatibility.


I can work on that. It should be as simple as setting commit->generation to
GENERATION_NUMBER_ZERO in fill_commit_in_graph when the graph
has a different version.




* **Immutable?** Git objects are _immutable_. If you change an object you
   actually create a new object with a new object ID. Are the values we
store
   for these reachability indexes also immutable?

Even if we do not embed the reachability ix in commit objects,
having an immutable value is probably a must if we want to make them
incrementally computable, so this is a very good property to have.
Unless there is a clever idea to incrementally compute a mutable
reach.ix, my gut instinct says that this property is a must.

Another thing, perhaps related to "local" below, is if exactly the
same reach.ix is computed by anybody, given an identical commit
history graph (perhaps "reproducibility"?).  I think most of the
candidates you listed are reproducible without a fixed tie-breaker,
but I am not sure about felineY() thing.


* **Local?** Are these values **locally computable**? That is, do we only
   need to look at the parents of a commit (assuming those parents have
   computed values) in order to determine the value at that commit?

A subset of non-local reachability ix, for example, the ones that
need to know what each commit's children are, cannot be immutable,
as adding new objects to the graph (either with locally committing,
or transferring objects from other repositories) would affect the
ix; is this true for all non-local reachability ix, I wonder?


As a thought experiment, we could define a function size(C) to be the
numberof commits reachable from C. This is not locally-computable
from the size values at C's parents due to the inclusion-exclusion
principle. We would need to compute it by walking the reachable set
and counting (resulting in quadratic performance overall) but is
immutable. Since the performance cost is so expensive (unlike the
linear costs in the other non-local versions) I didn't include it
in my comparison.




We focused on three types of performance tests that test the indexes
in different ways. Each test lists the `git` command that is used,
and the table lists which repository is used and which inputs.

### Test 1: `git log --topo-order -N`

This test focuses on the number of commits that are parsed during
a `git log --topo-order` before writing `N` commits to output.

A devil's advocate comment.  Your patches seem to be very focused on
this "unlimited" case for the past few weeks, but I am not so sure
if that is a case worth optimizing for.  If "git log --topo-order -N
HEAD~M.." 

Re: [RFC] Generation Number v2

2018-10-29 Thread Junio C Hamano
Derrick Stolee  writes:

> **V3: Corrected Commit Date.**
> For a commit C, let its _corrected commit date_ (denoted by cdate(C))
> be the maximum of the commit date of C and the commit dates of its
> parents.

"maximum of the commit date of C and the corrected commit dates of
its parents"?

We've talked about exactly this one in the past (long before any of
Microsoft folks other than Dscho came to the scene) but without an
implementation, or detailed design and analysis.  I am very happy to
see the idea among other possibilities to be considered again.  This
time around, we may finally come up with something better than the
"commit dates with SLOP" thing ;-).

> Essentially, the felineY order is selected with the goal of swapping
> positions of topologically-independent commits relative to the felinX
> ordering. The resulting reachability index is as follows:
>
>If felineX(A) < felineY(B), then A cannot reach B.
>If felineY(A) < felineY(B), then A cannot reach B.

I presume that the first line is a typo and you compare the same X index?

> * **Compatible?** In our test implementation, we use a previously unused
>   byte of data in the commit-graph format to indicate which reachability
>   index version we are using. Existing clients ignore this value, so we
>   will want to consider if these new indexes are _backwards compatible_.
>   That is, will they still report correct values if they ignore this byte
>   and use the generation number column from the commit-graph file assuming
>   the values are minimum generation numbers?

I personally consider that the current commit-graph with generation
numbers experimental, so I am not sure how much we care about this.

Having said that.

By the above definition, any new index that is wider than the
current generation number cannot be compatible (can we even tell the
existing clients how wide each elements in the ix array is?)

In any case, perhaps the first thing to do is to update the clients
so that they stop ignoring the version number field, and instead
work without generation number when there is no version of reach.ix
available in the file?  That way, a better reachablility index can
be chosen freely without having to worry about the compatibility.

> * **Immutable?** Git objects are _immutable_. If you change an object you
>   actually create a new object with a new object ID. Are the values we
> store
>   for these reachability indexes also immutable?

Even if we do not embed the reachability ix in commit objects,
having an immutable value is probably a must if we want to make them
incrementally computable, so this is a very good property to have.
Unless there is a clever idea to incrementally compute a mutable
reach.ix, my gut instinct says that this property is a must.

Another thing, perhaps related to "local" below, is if exactly the
same reach.ix is computed by anybody, given an identical commit
history graph (perhaps "reproducibility"?).  I think most of the
candidates you listed are reproducible without a fixed tie-breaker,
but I am not sure about felineY() thing.

> * **Local?** Are these values **locally computable**? That is, do we only
>   need to look at the parents of a commit (assuming those parents have
>   computed values) in order to determine the value at that commit?

A subset of non-local reachability ix, for example, the ones that
need to know what each commit's children are, cannot be immutable,
as adding new objects to the graph (either with locally committing,
or transferring objects from other repositories) would affect the
ix; is this true for all non-local reachability ix, I wonder?

> We focused on three types of performance tests that test the indexes
> in different ways. Each test lists the `git` command that is used,
> and the table lists which repository is used and which inputs.
>
> ### Test 1: `git log --topo-order -N`
>
> This test focuses on the number of commits that are parsed during
> a `git log --topo-order` before writing `N` commits to output.

A devil's advocate comment.  Your patches seem to be very focused on
this "unlimited" case for the past few weeks, but I am not so sure
if that is a case worth optimizing for.  If "git log --topo-order -N
HEAD~M.." (for some number M) gives us a similar result as unlimited
case but with much less effort, wouldn't it be good enough that lets
us concentrate on other use cases instead?

> Based on the performance results alone, we should remove minimum
> generation numbers, (epoch, date) pairs, and FELINE index from
> consideration. There are enough examples of these indexes performing
> poorly.

OK.

> In contrast, maximum generation numbers and corrected commit
> dates both performed quite well. They are frequently the top
> two performing indexes, and rarely significantly different.
>
> The trade-off here now seems to be: which _property_ is more important,
> locally-computable or backwards-compatible?

Nice summary.  

As I already said, I personally do not think 

Re: [RFC] Generation Number v2

2018-10-29 Thread Derrick Stolee

Here is a re-formatted version of the tables I introduced earlier.
The tables were too wide for public-inbox to render correctly (when
paired with my email client). Hopefully this bulleted-list format
works better. Thanks, Stefan, for pointing out the rendering
problems!

### Test 1: `git log --topo-order -N`

This test focuses on the number of commits that are parsed during
a `git log --topo-order` before writing `N` commits to output.

You can reproduce this test using `topo-order-tests.sh` and
see all the data in `topo-order-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

android-base, N = 100
    V0: 5487
    V1: 8534
    V2: 6937
    V3: 6419
    V4: 6453
android-base, N = 1000
    V0: 36029
    V1: 44030
    V2: 41493
    V3: 41206
    V4: 45431
chromium, N = 100
    V0:    101
    V1: 424406
    V2:    101
    V3:    101
    V4:    101
gerrit, N = 100
    V0: 8212
    V1: 8533
    V2:  164
    V3:  159
    V4:  162
gerrit, N = 1000
    V0: 8512
    V1: 8533
    V2: 1990
    V3: 1973
    V4: 3766
Linux, N = 100
    V0: 12458
    V1: 12444
    V2: 13683
    V3: 13123
    V4: 13124
Linux, N = 1000
    V0: 24436
    V1: 26247
    V2: 27878
    V3: 26430
    V4: 27875
Linux, N = 1
    V0: 30364
    V1: 28891
    V2: 27878
    V3: 26430
    V4: 27875
electron, N = 1000
    V0: 19927
    V1: 18733
    V2:  1072
    V3: 18214
    V4: 18214
Ffmpeg, N = 1
    V0: 32154
    V1: 47429
    V2: 10435
    V3: 11054
    V4: 11054
jgit, N = 1000
    V0: 1550
    V1: 6264
    V2: 1067
    V3: 1060
    V4: 1233
julia, N = 1
    V0: 43043
    V1: 43043
    V2: 10201
    V3: 23505
    V4: 23828
odoo, N = 1000
    V0: 17175
    V1:  9714
    V2:  4043
    V3:  4046
    V4:  4111
php-src, N = 1000
    V0: 19014
    V1: 27530
    V2:  1311
    V3:  1305
    V4:  1320
rails, N = 100
    V0: 1420
    V1: 2041
    V2: 1757
    V3: 1428
    V4: 1441
rails, N = 1000
    V0:  7952
    V1: 10145
    V2: 10053
    V3:  8373
    V4:  8373
swift, N = 1000
    V0: 1914
    V1: 4004
    V2: 2071
    V3: 1939
    V4: 1940
tensorflow, N = 1000
    V0: 10019
    V1: 39221
    V2:  6711
    V3: 10051
    V4: 10051
TypeScript, N = 1000
    V0:  2873
    V1: 12014
    V2:  3049
    V3:  2876
    V4:  2876


### Test 2: `git log --topo-order -10 A..B`

This test focuses on the number of commits that are parsed during
a `git log --topo-order A..B` before writing ten commits to output.
Since we fix a very small set of output commits, we care more about
the part of the walk that determines which commits are reachable
from `B` but not reachable from `A`. This part of the walk uses
commit date as a heuristic in the existing implementation.

You can reproduce this test using `topo-compare-tests.sh` and
see all the data in `topo-compare-summary.txt`. The values reported
here are a sampling of the data, ignoring tests where all values
were the same or extremely close in value.

_Note:_ For some of the rows, the `A` and `B` values may be
swapped from what is expected. This is due to (1) a bug in the
reference implementation that doesn't short-circuit the walk
when `A` can reach `B`, and (2) data-entry errors by the author.
The bug can be fixed, but would have introduced strange-looking
rows in this table.

android-base 53c1972bc8f 92f18ac3e39
  OLD: 39403
   V0:  1544
   V1:  6957
   V2:    26
   V3:  1015
   V4:  1098
gerrit c4311f7642 777a8cd1e0
  OLD:  6457
   V0:  7836
   V1: 10869
   V2:   415
   V3:   414
   V4:   445
electron 7da7dd85e addf069f2
  OLD: 18164
   V0:   945
   V1:  6528
   V2:    17
   V3:    17
   V4:    18
julia 7faee1b201 e2022b9f0f
  OLD: 22800
   V0:  4221
   V1: 12710
   V2:   377
   V3:   213
   V4:   213
julia ae69259cd9 c8b5402afc
  OLD:  1864
   V0:  1859
   V1: 13287
   V2:    12
   V3:  1859
   V4:  1859
Linux 69973b830859 c470abd4fde4
  OLD: 111692
   V0:  77263
   V1:  96598
   V2:  80238
   V3:  76332
   V4:  76495
Linux c3b92c878736 19f949f52599
  OLD: 167418
   V0:   5736
   V1:   4684
   V2:   9675
   V3:   3887
   V4:   3923
Linux c8d2bc9bc39e 69973b830859
  OLD: 44940
   V0:  4056
   V1: 16636
   V2: 10405
   V3:  3475
   V4:  4022
odoo 4a31f55d0a0 93fb2b4a616
  OLD: 25139
   V0: 19528
   V1: 20418
   V2: 19874
   V3: 19634
   V4: 27247
swift 4046359efd b34b6a14c7
  OLD: 13411
   V0:   662
   V1:   321
   V2:    12
   V3:    80
   V4:   134
tensorflow ec6d17219c fa1db5eb0d
  OLD: 10373
   V0:  4762
   V1: 36272
   V2:   174
   V3:  3631
   V4:  3632
TypeScript 35ea2bea76 123edced90
  OLD:  3450
   V0:   267
   V1: 10386
   V2:    27
   V3:   259
   V4:   259


### Test 3: `git merge-base A B`

This test focuses on the number of commits that are parsed during
a `git merge-base A B`. This part of the walk uses commit date as
a heuristic in the existing implementation.

You can reproduce this test using `merge-base-tests.sh` and
see all the data in `merge-base-summary.txt`. The values reported
here are a 

Re: [RFC] Generation Number v2

2018-10-29 Thread Derrick Stolee

On 10/29/2018 3:22 PM, Stefan Beller wrote:

Based on the performance results alone, we should remove minimum
generation numbers, (epoch, date) pairs, and FELINE index from
consideration. There are enough examples of these indexes performing
poorly.

In contrast, maximum generation numbers and corrected commit
dates both performed quite well. They are frequently the top
two performing indexes, and rarely significantly different.

The trade-off here now seems to be: which _property_ is more important,
locally-computable or backwards-compatible?

* Maximum generation number is backwards-compatible but not
locally-computable or immutable.

These maximum generation numbers sound like the reverse of
the generation numbers as they are currently implemented, i.e.
we count all commits between the commit A and all heads.
How would this impact creation of a commit?


The indexes listed here would be computed at the same time as a 
commit-graph (and only change on commit-graph writes).


This idea of having something depend on the "global state" isn't ideal 
(in my opinion) but it certainly works for the tests specified below.



The current generation numbers can be lazily updated or not
updated at all. In my understanding of the maximum generation
numbers, a new commit would make these maximum generation
numbers invalid (i.e. they have to be recomputed).


New commits would change the value you would get by the definition (on 
rewrite of the commit-graph) but don't violate the reachability index 
property (maxgen(A) < maxgen(B) =>

A can't reach B). So that doesn't effect their usefulness.


Are there ways out by deferring the computation of maximum
generation numbers while still being able to work with some
commits that are un-accounted for?


Creating a commit that doesn't exist in the commit-graph file results in 
a commit with generation number INFINITY. The new commits can't be 
reached by commits with finite generation number.



When recomputing these numbers, the current generation number
(V0) has the property that already existing numbers can be re-used
as-is. We only need to compute the numbers for new commits,
and then insert this to the appropriate data structures (which is
a separate problem, one could imagine a split generation
numbers file like the split index)

For the V2 maximum generation numbers, would we need to
rewrite the numbers for all commits once we recompute them?
Assuming that is true, it sounds like the benchmark doesn't
cover the whole costs associated with V2, which is why the
exceptional performance can be explained.
You're right, we need to recompute the entire list of maximum generation 
number values. However, you are a bit hung up on "maxgen" being a fixed 
function that is correct at all times. We could compute the "maxgen" for 
the set of commits that are written to the "split" commit-graph while 
leaving the base the same. There is one tricky part here: we need to 
initialize the values starting at "num commits in the combined 
commit-graph" (commits in the base and split portion). The minimum 
possible value in the split portion will be larger than the maximum 
possible value in the base portion. Since the (future) implementation of 
split commit-graphs would have all commit-parent edges pointing down 
into the base and not from the base into the split, this will continue 
to satisfy our reachability index property.


As for the cost: we need to do a topo-order walk and then a quick 
minimum-value check across all parents. This is O(N), and the constant 
is small. This may be a cost worth paying.



(Note: The table as read in
https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/
seems line broken, using gmails web interface is not good
for ASCII art and patches, git-send-email would fare better)


Sorry for that. I copy-pasted into Thunderbird. Hopefully most users can 
redirect to the GitHub-rendered markdown if they have trouble.





* Corrected commit-date is locally-computable and immutable,
but not backwards-compatible.

How are these dates not backwards incompatible?
We don't have to expose these dates to the user, but
- just like generation numbers - could store them and use them
but not tell the user about it.

We would need to be really careful to not expose them at all
as they look like the real dates, so that could make for an
awkward bug report.


By "backwards-compatible" I mean that we could store them in the 
commit-graph file without breaking existing clients (or by introducing 
extra data).


We could easily add these corrected commit dates as an optional chunk, 
but that adds 8 bytes per commit, and we already use 8 bytes for the 
(generation, date) pairs.


The solution I made in my prototype is to replace the generation number 
with an offset for how much to add to the commit-date to get the 
corrected commit-date. However, an old client would interpret these 
offsets as generations and have incorrect output.


A 

Re: [RFC] Generation Number v2

2018-10-29 Thread Stefan Beller
> Based on the performance results alone, we should remove minimum
> generation numbers, (epoch, date) pairs, and FELINE index from
> consideration. There are enough examples of these indexes performing
> poorly.
>
> In contrast, maximum generation numbers and corrected commit
> dates both performed quite well. They are frequently the top
> two performing indexes, and rarely significantly different.
>
> The trade-off here now seems to be: which _property_ is more important,
> locally-computable or backwards-compatible?
>
> * Maximum generation number is backwards-compatible but not
>locally-computable or immutable.

These maximum generation numbers sound like the reverse of
the generation numbers as they are currently implemented, i.e.
we count all commits between the commit A and all heads.
How would this impact creation of a commit?

The current generation numbers can be lazily updated or not
updated at all. In my understanding of the maximum generation
numbers, a new commit would make these maximum generation
numbers invalid (i.e. they have to be recomputed).

Are there ways out by deferring the computation of maximum
generation numbers while still being able to work with some
commits that are un-accounted for?

When recomputing these numbers, the current generation number
(V0) has the property that already existing numbers can be re-used
as-is. We only need to compute the numbers for new commits,
and then insert this to the appropriate data structures (which is
a separate problem, one could imagine a split generation
numbers file like the split index)

For the V2 maximum generation numbers, would we need to
rewrite the numbers for all commits once we recompute them?
Assuming that is true, it sounds like the benchmark doesn't
cover the whole costs associated with V2, which is why the
exceptional performance can be explained.

(Note: The table as read in
https://public-inbox.org/git/6367e30a-1b3a-4fe9-611b-d931f51ef...@gmail.com/
seems line broken, using gmails web interface is not good
for ASCII art and patches, git-send-email would fare better)

>
> * Corrected commit-date is locally-computable and immutable,
>but not backwards-compatible.

How are these dates not backwards incompatible?
We don't have to expose these dates to the user, but
- just like generation numbers - could store them and use them
but not tell the user about it.

We would need to be really careful to not expose them at all
as they look like the real dates, so that could make for an
awkward bug report.

The approach of "corrected commit date" sounds like we could
have a very lazy approach, i.e. no extra data structures needed
for many commits (as the corrected date equals the real date)
and only need to store the corrections for some commits.
Such an approach however would not make it easy to know
if we operate on corrected dates, or if we even checked them
for correctness before.

So if we'd have an additional row in the generation numbers file
telling the corrected date, then we should be able to be backwards
compatible?


[RFC] Generation Number v2

2018-10-29 Thread Derrick Stolee

We've discussed in several places how to improve upon generation
numbers. This RFC is a report based on my investigation into a
few new options, and how they compare for Git's purposes on
several existing open-source repos.

You can find this report and the associated test scripts at
https://github.com/derrickstolee/gen-test.

I have some more work to do in order to make this fully
reproducible (including a single script to clone all of the
repos I used, compute the commit-graphs using the different
index versions, and run the tests).

TL;DR: I recommend we pursue one of the following two options:

1. Maximum Generation Number.
2. Corrected Commit Date.

Each of these perform well, but have different implementation
drawbacks. I'd like to hear what everyone thinks about those
drawbacks.

Please also let me know about any additional tests that I could
run. Now that I've got a lot of test scripts built up, I can re-run
the test suite pretty quickly.

Thanks,
-Stolee



Generation Number Performance Test
==

Git uses the commit history to answer many basic questions, such as
computing a merge-base. Some of these algorithms can benefit from a
_reachability index_, which is a condition that allows us to answer
"commit A cannot reach commit B" in some cases. These require pre-
computing some values that we can load and use at run-time. Git
already has a notion of _generation number_, stores it in the commit-
graph file, and uses it in several reachability algorithms.

You can read more about generation numbers and how to use them in
algorithms in [this blog 
post](https://blogs.msdn.microsoft.com/devops/2018/07/09/supercharging-the-git-commit-graph-iii-generations/).


However, [generation numbers do not always improve our 
algorithms](https://public-inbox.org/git/pull.28.git.gitgitgad...@gmail.com/T/#u).

Specifically, some algorithms in Git already use commit date as a
heuristic reachability index. This has some problems, though, since
commit date can be incorrect for several reasons (clock skew between
machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.).
However, the speed boost by using commit date as a cutoff was so
important in these cases, that the potential for incorrect answers was
considered acceptable.

When these algorithms were converted to use generation numbers, we
_added_ the extra constraint that the algorithms are _never incorrect_.
Unfortunately, this led to some cases where performance was worse than
before. There are known cases where `git merge-base A B` or
`git log --topo-order A..B` are worse when using generation numbers
than when using commit dates.

This report investigates four replacements for generation numbers, and
compares the number of walked commits to the existing algorithms (both
using generation numbers and not using them at all). We can use this
data to make decisions for the future of the feature.

### Implementation

The reachability indexes below are implemented in
[the `reach-perf` branch in 
https://github.com/derrickstolee/git](https://github.com/derrickstolee/git/tree/reach-perf).

This implementation is in a very rough condition, as it is intended to
be a prototype, not production-quality.

Using this implementation, you can compute commit-graph files for the
specified reachability index using `git commit-graph write --reachable 
--version=`.

The `git` client will read the version from the file, so our tests
store each version as `.git/objects/info/commit-graph.` and copy
the necessary file to `.git/objects/info/commit-graph` before testing.

The algorithms count the number of commits walked, as to avoid the
noise that would occur with time-based performance reporting. We use
the (in progress) trace2 library for this. To find the values reported,
use the `GIT_TR2_PERFORMANCE` environment variable.

To ignore reachability indexes entirely and use the old algorithms
(reported here as "OLD" values) use the environment variable
`GIT_TEST_OLD_PAINT=1`.


Reachability Index Versions
---

**V0: (Minimum) Generation Number.**
The _generation number_ of a commit is exactly one more than the maximum
generation number of a parent (by convention, the generation number of a
commit with no parents is 1). This is the definition currently used by
Git (2.19.0 and later). Given two commits A and B, we can then use the
following reachability index:

    If gen(A) < gen(B), then A cannot reach B.

_Commentary:_ One issue with generation numbers is that some algorithms
in Git use commit date as a heuristic, and allow incorrect answers if
there is enough clock skew. Using that heuristic, the algorithms can walk
fewer commits than the algorithms using generation number. The other
reachability indexes introduced below attempt to fix this problem.

**V1: (Epoch, Date) Pairs.**
For each commit, store a pair of values called the _epoch_ and the _date_.
The date is the normal commit date. The _epoch_ of a commit is