Excerpts from Yuya Nishihara's message of 2017-02-03 00:45:22 +0900:
> On Thu, 2 Feb 2017 09:34:47 +0000, Jun Wu wrote:
> > Perf Numbers
> > 
> >   I wrote a hacky prototype [2] that shows significant improvements on
> >   various commands in our repo:
> >   
> >                                Before   After (in seconds)
> >       chg bookmark             0.40     0.08
> >       chg log -r . -T '{node}' 0.50     0.08
> >       chg sl                   0.71     0.27
> >       chg id                   0.71     0.37
> >   
> >   And hg-committed (with 12M obsstore) could benefit from it too mainly
> >   because the obsstore becomes preloaded:
> >   
> >                                Before   After
> >       chg bookmark             0.56     0.06
> >       chg log -r . -T '{node}' 0.56     0.07
> >       chg log -r . -p          0.86     0.08
> >       chg sl                   0.84     0.21
> >       chg id                   0.54     0.06
> >   
> >   So I think it's nice to get a formal implementation upstreamed. It's also
> >   easier to solve the perf issues individually like building the hidden
> >   bitmap, building an serialization format for the radix tree, etc.
> >   Although we can also build them later.
> 
> Nice. I want this perf number today. :)
> 
> > So what state do we store?
> > 
> >   {repopath: {name: (hash, content)}}. For example:
> > 
> >     cache = {'/home/foo/repo1': {'index': ('hash', changelogindex),
> >                                  'bookmarks': ('hash', bookmarks),
> >                                  .... },
> >              '/home/foo/repo2': { .... }, .... }
> > 
> >   The main ideas here are:
> >     1) Store the lowest level objects, like the C changelog index.
> >        Because higher level objects could be changed by extensions in
> >        unpredictable ways. (this is not true in my hacky prototype though)
> >     2) Hash everything. For changelog, it's like the file stat of
> >        changelog.i. There must be a strong guarantee that the hash matches
> >        the content, which could be challenging, but not impossible. I'll
> >        cover more details below.
> > 
> >   The cache is scoped by repo to make the API simpler/easy to use. It may
> >   be interesting to have some global state (like passing back the extension
> >   path to import them at runtime).
> 
> Regarding this and "2) Side-effect-free repo", can or should we design the API
> as something like a low-level storage interface? That will allow us to not
> make repo/revlog know too much about chg.
> 
> I don't have any concrete idea, but that would work as follows:
> 
>  1. chg injects an object to select storage backends
>     e.g. repo.storage = chgpreloadable(repo.storage, cache)
>  2. repo passes it to revlog, etc.
>  3. revlog uses it to read indexfile, where in-memory cache may be returned
>     e.g. storage.parserevlog(indexfile)
>
> Perhaps, this 'storage' object is similar to the one you call 
> 'baserepository'.

I'm not sure if I get the idea (probably not). How does the implementation
in the master server look like?

It feels more like "repo.chgcache" to me and the difference is that the
vanilla hg will be changed to access objects via it (so the interface looks
more consistent).

Things to consider:

  a) Objects being preloaded have dependency - ex. the obsstore depends on
     changelog and other things. The preload functions run in a defined
     order.
  b) The index file is not always a single file, depending on "vfs".
  c) The user may want to control what to preload. For example, if they have
     an incompatible manifest, they could make changelog preloaded, but not
     manifest.
  d) Users can add other preloading items easily, not only just the
     predefined ones.

I think "storage.parserevlog(indexfile)" (treating index separately, without
from a repo object) may have trouble dealing with "a)".

> >   Then, repo.chgcache['index'] becomes available in worker processes. When
> >   initializing the changelog index, try to use the chg cache:
> > 
> >       # inside changelog's revlog.__init__:
> >       # note: repo.chgcache is empty for non-chg cases, a fallback is needed
> >       self.index = repo.chgcache.get('index') or _loadindex(...)
> > 
> >   The API is the simplest that I can think of, while being also reasonably
> >   flexible (for example, we can add additional steps like forcing building
> >   the radix tree etc). But I'm open to suggestions.
> > 
> > Some implementation details
> > 
> >   (This is the part that I want feedback the most)
> > 
> >   1) IPC between chg master and forked worker
> > 
> >      This is how the worker tells the master about what (ex. repo paths) to
> >      preload.
> > 
> >      I think it does not need to be 100% reliable. So I use shared memory in
> >      the hacky prototype [2]. Pipes are reliable and may notify master
> >      quicker, while have risks of blocking. shm seems much easier to
> >      implement and I think the latency of master getting the information is
> >      not a big deal. I prefer shm, but other ideas are welcomed.
> 
> I like using pipes and it won't be complex so long as an IPC message is short
> enough to write atomically (< ~4k). But pipe-vs-shm is merely an 
> implementation
> detail, so we can switch them later if needed.
> 
> >   2) Side-effect-free repo
> > 
> >      This is the most "painful" part for the preloading framework to be
> >      confident.
> > 
> >      The preload function has a signature that takes a repo object. But chg
> >      could not provide a real repo object. So it's best-effort.
> > 
> >      On the other hand, most preload functions only need to hash file stats,
> >      i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured
> >      repo object.
> > 
> >      Therefore I think the choices are:
> > 
> >       a) Provide a repo object that is: localrepository - side effects
> >          (maximum compatibility)
> >       b) Build a fresh new repo object that has only the minimal part, ex.
> >          vfs, svfs etc. (less compatible)
> >       c) Do not provide a repo object. Just provide "repo path".
> >          (move the burden to the preload function writers)
> > 
> >      I currently prefer a). The plan is to move part of "localrepository" 
> > to 
> >      a side-effect free "baserepository" and use "baserepository" in the
> >      background preloading thread.
> 
> (see the comment above)
> 
> >   3) Side effect of extensions
> > 
> >      The new chg framework will not run uisetup()s. Where the preloading
> >      framework does sometimes depend on some side effects of extensions'
> >      uisetup()s. For example, the *manifest extension could change greatly
> >      about what the manifest structure so the default manifest preloading
> >      won't work as expected.
> > 
> >      I think there are different ways to address this:
> > 
> >       a) Add a top-level "chgsetup()" which only gets called by chg, and is
> >          meant to be side-effect free.
> >          
> >          Extensions could wrap chg's pseudorepo object, and also register
> >          its own preloading functions.
> > 
> >          This is actually pretty clean. But I'm not sure whether "chgsetup"
> >          is a good name or not, or if a top-level function is a good idea in
> >          general.
> > 
> >       b) Have a config option to force chg to load extensions as it does
> >          today.
> > 
> >          The problems are uisetup will accept wrong ui objects, which just
> >          confuses developers (and is the motivation of the ongoing
> >          refactoring). And it probably makes chg's logic more complex than
> >          ideal.
> > 
> >      I cannot think of other good ideas on this. In this situation, I prefer
> >      a).
> 
> (a) seems fine, if chgsetup() is 100% optional so the extension works 
> flawlessly
> without it (i.e. goes through uncached path.)
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to