On Tue, Jan 05, 2021 at 19:33:36 +0100, Joerg Sonnenberger wrote: > On Tue, Jan 05, 2021 at 04:38:20PM +0100, Raphaël Gomès wrote: > > I've opened a very much draft plan page [1] to try to list all the things we > > want to do in that version and try to figure out an efficient new format. > > "No support for hash version" > > I don't think that points really matters. The plan for the hash > migration allows them in theory to coexist fully on the revlog layer and > the main problems for mixing them are on the changeset/manifest layer > anyway. That is, any migration strategy will IMO rewrite all revlogs to > the newer hash anyway and only keep a secondary index for changesets and > maybe manifests.
At the same time, I think it is sensible (and very useful when looking an a revlog without repo-level info) for revlogs to identify which hash they contain. Either in some sort of revlog header or in each entry (if hash can vary between entries). > "No support for sidedata" > > My big design level concern is that revlog ATM is optimized for fast > integer indexing and append-only storage. This is an interesting point. What *are* the most common revlog operations? It probably varies between repos, but I suspect that they are mostly reads rather than writes. As a consequence, a good revlog format would optimize for the common case (without making the less common cases completely suck). > At least for some sidedata use cases I have, that is an ill fit. I actually have no idea what sidedata is, but I don't think it changes my point about picking formats that match the workload :) > "No support for unified revlog" > > IMO this should be the driving feature. Agreed (assuming that 'unified revlog' is just a placeholder name for 'a storage scheme that uses less than O(n) files to store revision data'). I always think twice before I move a file in a hg repo because I don't like wasting disk space. It's a stupid feeling, I know. > The biggest issue for me is that > it creates two challenges that didn't exist so far: > (1) Inter-file patches and how they interact with the wire protocol > (2) Identical revisions stored in different places. > > "No support for larger files" > > Supporting large revlog files is sensible and having a store for > design-challenged file systems might be necessary. Microsoft, I'm > looking at you. Otherwise the concern is space use in the revlog file > and RAM use during operations. I don't think the latter is as big an > issue now as it was 15 years ago, but the former is real. But it might > be a good point in time to just go for 64bit offsets by default... I'd *strongly* advocate for 64-bit offsets. They pretty much let you forget that there is a limit. Storage is cheap. If revlog entry size is a concern (e.g., it takes more than 1% of the size of the data it is tracking), then maybe a variable encoding would be the way to go. hg already makes use of CBOR, so it'd be reasonable to use here - either for the whole entry or just for parts of it. For example, CBOR's interegers are encoded as 1 byte type, followed by 0, 1, 2, 4, or 8 byte integer. Smaller values use less space. For example, values less than 2^32 use 1-5 bytes. A common alternative is LEB128 [1], which IIRC is used by git for something internally. It is however a bit more expensive to pack/unpack. Jeff. [1] https://en.wikipedia.org/wiki/LEB128 _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel