Re: New revlog format, plan page

Josef 'Jeff' Sipek Thu, 07 Jan 2021 09:10:09 -0800

On Tue, Jan 05, 2021 at 19:33:36 +0100, Joerg Sonnenberger wrote:
> On Tue, Jan 05, 2021 at 04:38:20PM +0100, Raphaël Gomès wrote:
> > I've opened a very much draft plan page [1] to try to list all the things we
> > want to do in that version and try to figure out an efficient new format.
> 
> "No support for hash version"
> 
> I don't think that points really matters. The plan for the hash
> migration allows them in theory to coexist fully on the revlog layer and
> the main problems for mixing them are on the changeset/manifest layer
> anyway. That is, any migration strategy will IMO rewrite all revlogs to
> the newer hash anyway and only keep a secondary index for changesets and
> maybe manifests.


At the same time, I think it is sensible (and very useful when looking an a
revlog without repo-level info) for revlogs to identify which hash they
contain.  Either in some sort of revlog header or in each entry (if hash can
vary between entries).

> "No support for sidedata"
> 
> My big design level concern is that revlog ATM is optimized for fast
> integer indexing and append-only storage.

This is an interesting point.  What *are* the most common revlog operations?
It probably varies between repos, but I suspect that they are mostly reads
rather than writes.  As a consequence, a good revlog format would optimize
for the common case (without making the less common cases completely suck).

> At least for some sidedata use cases I have, that is an ill fit.

I actually have no idea what sidedata is, but I don't think it changes my
point about picking formats that match the workload :)

> "No support for unified revlog"
> 
> IMO this should be the driving feature.

Agreed (assuming that 'unified revlog' is just a placeholder name for 'a
storage scheme that uses less than O(n) files to store revision data').  I
always think twice before I move a file in a hg repo because I don't like
wasting disk space.  It's a stupid feeling, I know.

> The biggest issue for me is that
> it creates two challenges that didn't exist so far:
> (1) Inter-file patches and how they interact with the wire protocol
> (2) Identical revisions stored in different places.
> 
> "No support for larger files"
> 
> Supporting large revlog files is sensible and having a store for
> design-challenged file systems might be necessary. Microsoft, I'm
> looking at you. Otherwise the concern is space use in the revlog file
> and RAM use during operations. I don't think the latter is as big an
> issue now as it was 15 years ago, but the former is real. But it might
> be a good point in time to just go for 64bit offsets by default...

I'd *strongly* advocate for 64-bit offsets.  They pretty much let you forget
that there is a limit.  Storage is cheap.

If revlog entry size is a concern (e.g., it takes more than 1% of the size
of the data it is tracking), then maybe a variable encoding would be the way
to go.

hg already makes use of CBOR, so it'd be reasonable to use here - either for
the whole entry or just for parts of it.  For example, CBOR's interegers are
encoded as 1 byte type, followed by 0, 1, 2, 4, or 8 byte integer.  Smaller
values use less space.  For example, values less than 2^32 use 1-5 bytes.

A common alternative is LEB128 [1], which IIRC is used by git for something
internally.  It is however a bit more expensive to pack/unpack.

Jeff.

[1] https://en.wikipedia.org/wiki/LEB128
_______________________________________________
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel

Re: New revlog format, plan page

Reply via email to