On 1/7/21 8:52 PM, Pierre-Yves David wrote:


On 1/5/21 7:33 PM, Joerg Sonnenberger wrote:
On Tue, Jan 05, 2021 at 04:38:20PM +0100, Raphaël Gomès wrote:
I've opened a very much draft plan page [1] to try to list all the things we want to do in that version and try to figure out an efficient new format.

"No support for hash version"

I don't think that points really matters. The plan for the hash
migration allows them in theory to coexist fully on the revlog layer and
the main problems for mixing them are on the changeset/manifest layer
anyway. That is, any migration strategy will IMO rewrite all revlogs to
the newer hash anyway and only keep a secondary index for changesets and
maybe manifests.

I agree here, the hash used will likely be defined at repository level (or at least revlog level).


"No support for sidedata"

My big design level concern is that revlog ATM is optimized for fast
integer indexing and append-only storage. At least for some sidedata
use cases I have, that is an ill fit.

The current spirit for sidedata is to have

Looks like this sentence got interru…

The current spirit for sidedata is for them to contain computed data that are inherent to the changesets (or revision in general) and can be computed once and for all when the changesets is added.

The storage proposed in revlog v2 requires the data to be added at "revision addition time" but does not requires the sidedata to be next the changeset data. This simplify operation that needs the rest of the changegroupe (manifest, filelog) to be added before computation.

It also means one could "update" the sidedata by "simply" rewriting the index.

I am sympathetic to a more generic storage for more volatile data. However the current proposal is good enough for the current goal and a couple of other and quite simple to implement. So the plan is to go with it for now.

"No support for unified revlog"

IMO this should be the driving feature. The biggest issue for me is that
it creates two challenges that didn't exist so far:
(1) Inter-file patches and how they interact with the wire protocol

I not worried here, inter-file patches should be able as simple as using a delta base pointing to the content of another file. And regarding the wireprotocol, we are already very bad at dealing with delta to non-parent, so we should be about as bad.

(2) Identical revisions stored in different places.

The broad plan of unified revlog is to have store things using a pair of identifier (content hash (eg: filenodeid) and content identifier. The two main options here are:

* using a hash of the target content (taking 32bits, "expensive" to search
* using some integer identifier and an associated side mapping for content → ID mapping. (and over the wire translation to non local identifier).

The second option seems more time and space efficient, so I am leaning toward it.

Either way, I think similar content (ie: same nodeid), should probably be stored twice in the index to keep current properly, we can reuse the data segment however. So the uniqness and indexing would happens using the (nodeid, contentid) pairs



"No support for larger files"

Supporting large revlog files is sensible and having a store for
design-challenged file systems might be necessary. Microsoft, I'm
looking at you. Otherwise the concern is space use in the revlog file
and RAM use during operations. I don't think the latter is as big an
issue now as it was 15 years ago, but the former is real. But it might
be a good point in time to just go for 64bit offsets by default...

Right now, offset are 6 bytes, so we can use revlog.d up to 281 TB, that seems good enough. The main "limitation" is about the file content, currently limited at 4GB. Given that we hold these in RAM for now, I don't think we need to bump it. We can bump it when introducing smarter RAM handling for such file.


--
Pierre-Yves David
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Reply via email to