This is mainly to discuss more details about chg repo preloading idea [1]. Perf Numbers
I wrote a hacky prototype [2] that shows significant improvements on various commands in our repo: Before After (in seconds) chg bookmark 0.40 0.08 chg log -r . -T '{node}' 0.50 0.08 chg sl 0.71 0.27 chg id 0.71 0.37 And hg-committed (with 12M obsstore) could benefit from it too mainly because the obsstore becomes preloaded: Before After chg bookmark 0.56 0.06 chg log -r . -T '{node}' 0.56 0.07 chg log -r . -p 0.86 0.08 chg sl 0.84 0.21 chg id 0.54 0.06 So I think it's nice to get a formal implementation upstreamed. It's also easier to solve the perf issues individually like building the hidden bitmap, building an serialization format for the radix tree, etc. Although we can also build them later. Stateful? Stateful is to the chg master server process. Currently the chg master process is stateless, after forking, the client talks to the forked worker directly, without affecting the master, and the worker does not talk back to master too: client master worker | connect() -> | accept() : | fork() -----> | | send() --------------------> | recv() # client and worker no longer talk to master Therefore the master is currently stateless - it can only preload extensions but nothing about the repo state. The general direction is to make the worker tells the master what needs to be preloaded (like repo paths), and the master has a background thread preloading them. Then at fork(), the new worker will get the cache for free. How about just preloading the repo object? There was a failed experiment: [3]. The repo object depends on too many side-effects, and gets invalidated too easily. chg will not run uisetup()s, so it'll become harder to get a proper repo object. So what state do we store? {repopath: {name: (hash, content)}}. For example: cache = {'/home/foo/repo1': {'index': ('hash', changelogindex), 'bookmarks': ('hash', bookmarks), .... }, '/home/foo/repo2': { .... }, .... } The main ideas here are: 1) Store the lowest level objects, like the C changelog index. Because higher level objects could be changed by extensions in unpredictable ways. (this is not true in my hacky prototype though) 2) Hash everything. For changelog, it's like the file stat of changelog.i. There must be a strong guarantee that the hash matches the content, which could be challenging, but not impossible. I'll cover more details below. The cache is scoped by repo to make the API simpler/easy to use. It may be interesting to have some global state (like passing back the extension path to import them at runtime). What's the API? (This is an existing implementation detail. I'm open to any ideas) For example, let's say we want to preload the changelog index (code simplified so it does not take care of all corner cases). First, tell chg how to hash and load something: from mercurial.chgserver import repopreload @repopreload('index') def foopreloader(repo): # use the size of changelog.i as the hash. note: the hash function # must be very fast. hash = repo.svfs.stat('00changelog.i').st_size # tell chg about the current hash. if hash matches, the generator # function stops here. yield hash # if hash mismatches, load the changelog in a slower way. with repo.svfs('00changelog.i') as f: data = f.read() hash = len(f) index = _buildindex(data) index.partialmatch('ffff') # force build the radix tree # tell chg about the loading result and the hash that # absolutely matches the result. yield index, hash Then, repo.chgcache['index'] becomes available in worker processes. When initializing the changelog index, try to use the chg cache: # inside changelog's revlog.__init__: # note: repo.chgcache is empty for non-chg cases, a fallback is needed self.index = repo.chgcache.get('index') or _loadindex(...) The API is the simplest that I can think of, while being also reasonably flexible (for example, we can add additional steps like forcing building the radix tree etc). But I'm open to suggestions. Some implementation details (This is the part that I want feedback the most) 1) IPC between chg master and forked worker This is how the worker tells the master about what (ex. repo paths) to preload. I think it does not need to be 100% reliable. So I use shared memory in the hacky prototype [2]. Pipes are reliable and may notify master quicker, while have risks of blocking. shm seems much easier to implement and I think the latency of master getting the information is not a big deal. I prefer shm, but other ideas are welcomed. 2) Side-effect-free repo This is the most "painful" part for the preloading framework to be confident. The preload function has a signature that takes a repo object. But chg could not provide a real repo object. So it's best-effort. On the other hand, most preload functions only need to hash file stats, i.e. just use repo.vfs, repo.svfs etc. They do not need a full-featured repo object. Therefore I think the choices are: a) Provide a repo object that is: localrepository - side effects (maximum compatibility) b) Build a fresh new repo object that has only the minimal part, ex. vfs, svfs etc. (less compatible) c) Do not provide a repo object. Just provide "repo path". (move the burden to the preload function writers) I currently prefer a). The plan is to move part of "localrepository" to a side-effect free "baserepository" and use "baserepository" in the background preloading thread. 3) Side effect of extensions The new chg framework will not run uisetup()s. Where the preloading framework does sometimes depend on some side effects of extensions' uisetup()s. For example, the *manifest extension could change greatly about what the manifest structure so the default manifest preloading won't work as expected. I think there are different ways to address this: a) Add a top-level "chgsetup()" which only gets called by chg, and is meant to be side-effect free. Extensions could wrap chg's pseudorepo object, and also register its own preloading functions. This is actually pretty clean. But I'm not sure whether "chgsetup" is a good name or not, or if a top-level function is a good idea in general. b) Have a config option to force chg to load extensions as it does today. The problems are uisetup will accept wrong ui objects, which just confuses developers (and is the motivation of the ongoing refactoring). And it probably makes chg's logic more complex than ideal. I cannot think of other good ideas on this. In this situation, I prefer a). The rough plan I'd like to get the preloading API done. And then add logic to preload various things in another hgext. Preloading changelog index could be done with confidence. While other things could be a bit risky as extensions are unpredictable. Therefore a config option per repo to control what to preload is necessary. [1]: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-December/091846.html [2]: https://bpaste.net/show/0dd5889cb453 [3]: https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-March/081615.html _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel