I've wanted to hack on overhauling revlogs and storage for several months now. The more I think about how to go about it and the reasons why I want to do it (mainly performance and scaling), the more I think a more drastic departure from revlogs and the current storage "backend" is needed. There are some great properties of revlogs and the direct-addressable model of the current store, don't get me wrong. But we can only scale that model so far (as I'm sure anyone from Facebook or Google will tell you).
Anyway, overhauling storage is a daunting proposition. It should be terrifying (because it is). I started thinking about how we could facilitate extreme experimentation on things like say swapping out a new storage backend. Perhaps even with one implemented in Rust :) Durham did a lot of work on the manifest code a few months back. Essentially, he successfully decoupled the API of the manifest from the low-level implementation of a revlog. It even abstracts away flat versus tree manifests. It's great stuff. I was reminded of his work when I started trying to do something similar for revlogs. Long story short, one thing led to another and I had this crazy idea that it would be a good idea to declare formal interfaces for important constructs, like the changelog, manifests, file history, peers, and quite possibly the repo object itself. If we did this and somehow enforced the interface, it would be possible to swap out implementations as needed. For example, if the classic store with 1 file/revlog per tracked path doesn't scale for you because you have >1M files, then you can swap in a store that uses "pack files," remote storage, etc. If a classic filesystem-based working directory doesn't fit the bill, perhaps you swap in one that is virtual filesystem aware. I wanted to experiment with this idea with something that is easier to reason about than storage. That led me to the "peer" classes (peer.py, httppeer.py, sshpeer.py, etc). I've just submitted a series where I formalize the peer interface using abstract base classes. Read the commit messages for https://phab.mercurial-scm.org/D332 and https://phab.mercurial-scm.org/D339 and the commits in that stack for more details. Part of developing that series uncovered a number of minor bugs. And, I think the end result is a peer API that is more easily understood and easier to hack on. So, I think there is merit to the approach for code maintainability reasons alone. But I want to dream bigger and apply this to more significant primitives (like storage). If we adopt formal interfaces for important constructs, I'd like to see support in the test harness for swapping in alternate implementations of these things. Jun recently added #testcases syntax to .t tests so we could run multiple variations of the same test. I'd like do something similar at the entire test suite level. e.g. `run-tests.py --changelog=sqlite` or `run-tests.py --store=leveldb`. Or more realistically, `run-tests.py --peers http,ssh` would run the test suite using both the http and ssh peer implementations. I could see that culminating with tests naturally dividing themselves into low-level unit tests (interface implementation specific) and higher-level, generic integration tests. If we "code to the interface," it should be possible to swap in a brand new implementation of something like a peer protocol or changelog and it will pass the integration tests. We could even have dummy, super fast implementations to facilitate hacking on things like frontend features to help reduce the edit-test cycle. Formal interfaces will facilitate extreme experimentation without the traditional fragility that a dynamic language like Python introduces. They will allow us to more clearly define boundaries between components. This will make it vastly easier to refactor and do things like rewrite large components in Rust. It would also make it *much* easier to implement "hgit" (using the Mercurial CLI to interface with a Git repository, advanced features like revlogs and all). (Call me crazy, but I'd love to ship this feature as part of Mercurial to help entice new users.) Regardless of whether we go all in on formal interfaces, there's an interesting idea 2 paragraphs back: tests that are implementation agnostic. Today, we end up duplicating test functionality for minor variations. e.g. http vs ssh vs local peer interactions. bundle1 vs bundle2. I'd really like to move many tests to Jun's #testcases feature because it will allow us to achieve higher test coverage while writing fewer tests. I think it would be worthwhile to figure out how to consolidate as many tests as possible so they can test behavior, not a specific implementation. A side-benefit of doing this is we'll uncover areas where implementations vary in behavior. This will help squash bugs and produce a more consistent user interface and experience, because every time we see e.g. different behavior between things performing the same role, we'll ask ourselves why the discrepancy. Anyway, this is probably the craziest Mercurial idea I've had in a while. I'm not sure how much of it is realistic. But I'd certainly like to establish some formality around interfaces for core components to facilitate code comprehension, refactoring, and testing. The work I just submitted on the peer API seems to show potential. I'm just not sure if we can achieve some of the more ambitious goals with the approach I've taken in that series. I'd love to hear what others think. Gregory
_______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel