On Thu, 2 Feb 2017 09:34:47 +0000, Jun Wu wrote: > Perf Numbers > > I wrote a hacky prototype [2] that shows significant improvements on > various commands in our repo: > > Before After (in seconds) > chg bookmark 0.40 0.08 > chg log -r . -T '{node}' 0.50 0.08 > chg sl 0.71 0.27 > chg id 0.71 0.37 > > And hg-committed (with 12M obsstore) could benefit from it too mainly > because the obsstore becomes preloaded: > > Before After > chg bookmark 0.56 0.06 > chg log -r . -T '{node}' 0.56 0.07 > chg log -r . -p 0.86 0.08 > chg sl 0.84 0.21 > chg id 0.54 0.06 > > So I think it's nice to get a formal implementation upstreamed. It's also > easier to solve the perf issues individually like building the hidden > bitmap, building an serialization format for the radix tree, etc. > Although we can also build them later.
Nice. I want this perf number today. :) > So what state do we store? > > {repopath: {name: (hash, content)}}. For example: > > cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), > 'bookmarks': ('hash', bookmarks), > .... }, > '/home/foo/repo2': { .... }, .... } > > The main ideas here are: > 1) Store the lowest level objects, like the C changelog index. > Because higher level objects could be changed by extensions in > unpredictable ways. (this is not true in my hacky prototype though) > 2) Hash everything. For changelog, it's like the file stat of > changelog.i. There must be a strong guarantee that the hash matches > the content, which could be challenging, but not impossible. I'll > cover more details below. > > The cache is scoped by repo to make the API simpler/easy to use. It may > be interesting to have some global state (like passing back the extension > path to import them at runtime). Regarding this and "2) Side-effect-free repo", can or should we design the API as something like a low-level storage interface? That will allow us to not make repo/revlog know too much about chg. I don't have any concrete idea, but that would work as follows: 1. chg injects an object to select storage backends e.g. repo.storage = chgpreloadable(repo.storage, cache) 2. repo passes it to revlog, etc. 3. revlog uses it to read indexfile, where in-memory cache may be returned e.g. storage.parserevlog(indexfile) Perhaps, this 'storage' object is similar to the one you call 'baserepository'. > Then, repo.chgcache['index'] becomes available in worker processes. When > initializing the changelog index, try to use the chg cache: > > # inside changelog's revlog.__init__: > # note: repo.chgcache is empty for non-chg cases, a fallback is needed > self.index = repo.chgcache.get('index') or _loadindex(...) > > The API is the simplest that I can think of, while being also reasonably > flexible (for example, we can add additional steps like forcing building > the radix tree etc). But I'm open to suggestions. > > Some implementation details > > (This is the part that I want feedback the most) > > 1) IPC between chg master and forked worker > > This is how the worker tells the master about what (ex. repo paths) to > preload. > > I think it does not need to be 100% reliable. So I use shared memory in > the hacky prototype [2]. Pipes are reliable and may notify master > quicker, while have risks of blocking. shm seems much easier to > implement and I think the latency of master getting the information is > not a big deal. I prefer shm, but other ideas are welcomed. I like using pipes and it won't be complex so long as an IPC message is short enough to write atomically (< ~4k). But pipe-vs-shm is merely an implementation detail, so we can switch them later if needed. > 2) Side-effect-free repo > > This is the most "painful" part for the preloading framework to be > confident. > > The preload function has a signature that takes a repo object. But chg > could not provide a real repo object. So it's best-effort. > > On the other hand, most preload functions only need to hash file stats, > i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured > repo object. > > Therefore I think the choices are: > > a) Provide a repo object that is: localrepository - side effects > (maximum compatibility) > b) Build a fresh new repo object that has only the minimal part, ex. > vfs, svfs etc. (less compatible) > c) Do not provide a repo object. Just provide "repo path". > (move the burden to the preload function writers) > > I currently prefer a). The plan is to move part of "localrepository" to > a side-effect free "baserepository" and use "baserepository" in the > background preloading thread. (see the comment above) > 3) Side effect of extensions > > The new chg framework will not run uisetup()s. Where the preloading > framework does sometimes depend on some side effects of extensions' > uisetup()s. For example, the *manifest extension could change greatly > about what the manifest structure so the default manifest preloading > won't work as expected. > > I think there are different ways to address this: > > a) Add a top-level "chgsetup()" which only gets called by chg, and is > meant to be side-effect free. > > Extensions could wrap chg's pseudorepo object, and also register > its own preloading functions. > > This is actually pretty clean. But I'm not sure whether "chgsetup" > is a good name or not, or if a top-level function is a good idea in > general. > > b) Have a config option to force chg to load extensions as it does > today. > > The problems are uisetup will accept wrong ui objects, which just > confuses developers (and is the motivation of the ongoing > refactoring). And it probably makes chg's logic more complex than > ideal. > > I cannot think of other good ideas on this. In this situation, I prefer > a). (a) seems fine, if chgsetup() is 100% optional so the extension works flawlessly without it (i.e. goes through uncached path.) _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel