xxx

On 5/1/21 12:45 AM, Joerg Sonnenberger wrote:
Hello all,
this is a bit of a brain dump based on a recent discussion with
Pierre-Yves on IRC. I think it deserves some input from others and maybe
it can grow into a proper plan in time.

Thanks for writing this down!

For the context of this discussion, see
   https://phab.mercurial-scm.org/D10082
and
   https://bz.mercurial-scm.org/show_bug.cgi?id=6219
In short, the changelog entry itself is supposed to keep a list of files
touched by the change, but there are two major shortcomings:
(1) It doesn't show removals.
(2) Older versions notoriously mishandled merges.

This is painful not just for the review mentioned above, but various
other places that want to efficiently track what files could have been
affected by a changeset.

Note that as part of my new copy tracing work I actually added the ability to compute, record and access such information.

https://www.mercurial-scm.org/repo/hg/file/5.8/mercurial/metadata.py#l31

This should provide a, reliable, out-of-hash, equivalent to the current `files` field. With extra informations.

This is made even worse by the inability to do
get fast manifest deltas. The current manifest code again has two major
shortcomings:
(1) It only works against the delta parent and that isn't necessarily
one of the changeset parents.
(2) It doesn't show removals.

Now I have been discussing "Bonsai" changesets as option for a while,
but that would certainly introduce a major breakage and is therefore
something that must be opt-in. But the idea by itself has enough merits
and brings the question on whether we can't compute the equivalent as
(non-optional) cache. What do we want in such a cache? We want to allow
walking from parent to child or the other way around and efficiently
compute the new file nodes. For merge commits, it should be possible to
traverse to either parent efficently. The first idea would be to store
lists of (path, old node, new node) for non-merges and (path, p1 node,
p2 node, new node) for merge commits. Added/removed files are using a
special marker node like <null> like in other places. This would be
require variable size entries (more parsing) and quite a bit of text
storage for big/deep repos.

We also have to keep tree manifest in mind. I would like us to build tree-manifest-compatible code, hash, and structure in the future to make sure the feature is available to people we starts to needs it.

Keeping list of changes from both parent can come with various problems as many "unmerged" (ie: all the simply updated) files have to be included leading to a large amount of duplicated information for every merge creating associated size issue. (However your final proposal might work around that)

That's undesirable, so the first improvement
would be to have a side table of the path names similar to how we handle
it in the rev-branch-cache.

There is some overlap with other things we do here:
* the fncache as somewhat close to a list of all the files we ever saw,
* the dirstate as a list of all the files currently relevant for the revision (and possibly other files with coming new format),
* the new sidedata fields also store filename of affected files.

So having some kind of universal way to store, retrieve and represent filename could make sense.

The second option would be to store file
revisions and not nodes, but that has some concerns about strip handling.
It would reduce the size of the entries by factor of 4 to 7 though.

Are you looking for a performance benefit, a size benefit (or both) or something else here ? If something else what are we talking about ?

Using
such a data structure for traversing is easy, just figure out which
parent is the desired parent and pick the appropriate old/new entries.

(again, keeping data for p1 and p2 may result in size explosion, so we have to be careful)

Computing this structure is bit tricky as we do not want to ever compute
the full text of parent or child. So we need to process the diffs
directly.

The computation in `mercurial/metadata.py` should have everything we need (or be quite close to that).

For that, we need to know the starting position of each file.
The end of each line we compute easily based on the node len and the
path name. We can exploit the special nature of the manifest: all lines
are sorted and path is unique. Based on that we can build an augmented
binary tree that keeps track of the size of its subtrees. The tree can
be shared with the delta base and updated bottom-up, e.g. as
red-black-tree that would require O(d log n) new nodes to be written
where d is the number of lines touched and n the total number of lines
in the manifest. For merges or if the delta is not against the parent of
a changeset, a bit more work is necessary to compute the precise diff.
This can be done either by walking the path between the nodes or by
recursively comparing the trees. I would expect that in most cases the
recursive compare works quite well as most tree nodes should end up
being shared for the typical short term forks and the like.

A tree approach looks quite interesting, since it solve the "storing the diff to p1 and the diff p2 leading to duplicating a pathological amount of data" problem.


I am not certain of what you suggests, but at that rate, it looks like we could "simply" have some kind of tree based storage for a full manifest and benefit from a natural and fast diff between manifest, ie: without having to restore the full manifest.

In otherword: maybe well built tree manifests focused on efficient diff and reuse of information could solve this. (We could even move to be the official manifest scheme, using some recursive/commutative hashing)

However that would not fully superseed what we currently store in "ChangedFiles" to perform changeset centric copy tracing. So we would need a bit more than just that. I am not worried however.

Cheers,

--
Pierre-Yves David
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to