Excerpts from Pierre-Yves David's message of 2016-11-10 17:22:32 +0000:
> 
> On 11/10/2016 04:34 PM, Jun Wu wrote:
> > Excerpts from Pierre-Yves David's message of 2016-11-10 15:02:26 +0000:
> >> Plan page does not needs to be very complex,
> >>
> >> So you current plan is to have this "(path, fnode) → [linkrev]" mapping
> >> stored into a sqlite database and update it on the fly when needed. What
> >> were your rational to pick this over a more complete cache of
> >> (changerev, path, filenode) exchanged over the wire ?
> >>
> >> Out of curiosity, do you have speedup number for your current experiment ?
> >
> > The experiment is still ongoing so I don't have data or clear plans yet.
> > Things can change - for example, I may want to replace sqlite with gdbm.
> > (So I'm still not motivated to write such a subject-to-change plan page)
> >
> >
> > On "on demand":
> >
> > The motivation of the linkrev cache is to speed up building fastannotate
> > cache, which essentially goes through every file in the repo and does a
> > massive adjustlinkrevs. So it's better to have an offline command like
> > "debugbuildlinkrevcache" to build the cache, in a more efficient way. I'm
> > less concerned about the on-demand use-case - Ideally I want to make sure
> > the linkrev database is *always* up-to-date, i.e. it contains all possible
> > candidate linkrevs for every file node so changelog.d walk becomes
> > impossible and we can have another optimization: if there is only one
> > candidate linkrevs, just use it, skip the changelog.i ancestor test.
> >
> > That said, "on demand" could be an optional feature, it's not hard to
> > implement after all.
> 
> Now I'm confused I assumed your database was on demand but is actually 
> command triggered, so how is your cache updated as the repo move forward?

It could be:

  1. as the first naive step, put "debugbuildlinkrevcache" in crontab
  2. hook things like "commit" / "pull" so we can maintain a state that
     the database is always up-to-date

> (I admit I did not read your paste  in full depth and did not saw 
> details about that in the initial doc)
> 
> > On "(changerev, path, filenode)":
> >
> > Note that "filenode" is redundant because "(changerev, path)" can decide
> > the filenode.
> 
> Yes, but with a manifest look up so having it on the cache is more cachy ;-)
> 
> > I believe a traditional database for "(changerev, path) -> linkrev" is not
> > the way to go as it could be extremely inefficient on space usage with an
> > offline command to build it. (think about storing manifests un-delta-ed)
> 
> The "path" part does not necessarly have to be in the database, if we 
> have a per-filelog-cache alongside the filelog, that would not be taking 
> extra space.

I'm not sure what you plan to do with multiple changelog revisions. Let's
look at an example:

  changelog rev: a --- b --- (10000 changesets) --- d --- e
  filelog rev:   1 --- 1 --- (unchanged)        --- 1 --- 2

So if we store all changerev, it's 10000x more (than the no changerev
version) in this case, which is unacceptable.

If we don't store all of them, but just store "a", "e". (Let's name this
approach P, which will be mentioned below).  We lose the ability to look up
b::d quickly - and that is important. 

> In addition, we can certain have cache that only contains entry for 
> changeset touching a file. given the pattern of filelog traversal this 
> is probably enough to boost adjust linkrev massively,

IIUC, that's what you mentioned at the sprint - build up a DAG for each
filelog so once the first (expensive) introrev is done and the following
parent lookups are cheap.  <- This is approach P.

But we want to make the first introrev look-up fast as well. So we need the
w/o "changerev" version anyway. This kind of query is even more important in
the fastannotate use-case because once you find some revision stored by
fastannotate (ideally just after the first introrev lookup), you don't even
need to look up parents.

> > I didn't say the current approach I'm taking is the unique / best way to
> > address the adjustlinkrev perf problem. There is possibility that we can
> > have O(1)-like adjustlinkrev with cost of space usage like summing up all .i
> > files.
> >
> > That approach is to make linkrev unique. Since we cannot BC the hash
> > function, we can store new file nodes which takes the commit hash as a salt
> > (and new manifests) in parallel. If we use tree-manifest, the space usage
> > will be somehow feasible. I didn't go this way because it's more complex
> > (may require invent some new formats) and depends on tree manifest, which is
> > still "experimental" internally.
> 
> I really don't think having to slat all hash because of some internal 
> implementation details is a good way to go, salting all filenode would 
> have serious impact, especially when commit rewrite comes into play.

I forgot to mention, the new file nodes do not contain file content. So
their data is basically a "blobid" (if we can have a "blob" layer like git,
or "old filenode id") pointing to existing contents. So re-hash all of them
is not slow even giant files are touched.

If you think about it, the number of new salted file nodes are <= the number
of the database rows in approach P  - for every "new file node", there is
one (or more, depending on db schema) row in the database you described. The
beautiful part of the "new file nodes" approach is, combined with the tree
manifest structure, we could answer the "looking up b::d" question
efficiently - not doable using traditional database.

> > On "exchange":
> >
> > The current data ((path, fnode) -> [linkrev]) map could also be exchanged -
> > but there is no immediate plan for that right now.
> >
> >>> But that does not speed up changelog.d walk. And I doubt that the use of
> >>> _ancestrycontext could lead to incorrect results in some cases if they are
> >>> the same object for the chain.
> >>
> >> The changelog.d walk is about looking for modified file, right ?
> >
> > Yes. The changelog.d walk happens if none of the linkrev candidates is an
> > ancestor of "srcrev". It's basically what git does all the time (one of the
> > key reasons that git does not scale currently imo).
> 
> Goes git have a "file touched" field too? or do they need to look at the 
> manifest/tree too?

It does not have any optimization like linkrev or files in changelog. It
stores the minimal information. So you can think it as doing "changelog.d"
walk and lookup tree manifests (tree objects as they call it) all the time.
Yet it's not that slow because of C.
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to