# HG changeset patch # User Gregory Szorc <gregory.sz...@gmail.com> # Date 1494911192 25200 # Mon May 15 22:06:32 2017 -0700 # Node ID 36035e27aeba572c06d450310ac6b041169b3f5e # Parent 49621512d133723fcf54d9c5cd865a40a81446e9 internals: define revlog version 2
(This is an RFC. I think I have a compelling enough reason to justify revlog v2 for zstd. But if we go this route, we should - as Augie says - drive a truck through the opening.) Today, revlogs define the encoding of revision data using an inline 1 byte header on each revision. This solution is nice because it is extensible. But that extensibility comes at a cost: performance. In order to decode the raw revision data, you need to first read it. The way revlog reading is implemented (in Python) today is we iterate through raw chunks, peek at the first byte, then invoke a decompressor (if necessary). For a single threaded reader implemented in Python, this is arguably sufficient. But in a future where we have multi-threaded readers - possibly implemented in other languages - this can introduce problems. There are 2 intertwined problems: practically interfacing with a multi-threaded reading API and performance. As a thought experiment, how would you implement an efficient multi-threaded revlog reader today? To implement a high performance multi-threaded reader, you need to perform CPU bound work outside of Python and outside of the GIL. The CPU bound work for revlog reading is decompression. So, you need a way to perform multi-threaded decompression outside the GIL and Python. If we have compression stored inline in revision data, you have 2 choices: 1) Teach the non-Python code about revision headers and how to handle them. 2) Load revision data and peek at the first byte *before* you call into the non-Python/multi-threaded code so the decoding API only sees data in a specific format. (This allows e.g. calling a multi-decompress function for zlib, zstd, etc.) In #1, we end up duplicating revlog "parsing" logic in our non-Python API. If you are implementing a full revlog reader, fine. But if all you are trying to do is something like offload decompression to a specialized multi-threaded API, this is unwanted complexity. #2 gets the job done, but at a cost of sacrificed performance potential. Today, by the time Python code calls into a decompressor, revlog data is already read() from a file handle into memory. So, peeking at the first byte is pretty cheap. But this I/O pattern undermines a future optimization: memory mapped files. Instead of pre-buffering a file handle into memory, what if we instead memory mapped the revlog. In this world, actual I/O is deferred until first access. If encoding is stored inline, peeking at the first byte will sparsely page the revlog into memory. This is sub-optimal because the loading is sparse (as opposed to sequential reads for the full data you want) and is likely limited to a single thread (Python). A solution to these deficiencies is to store the encoding state of each revision out-of-band with the revision data. If you know the encoding of each revision's data before actually reading the revision data, you can: * Filter/group data before loading it then call Call domain-specific, multi-decode/decompress APIs for specific formats. (This makes implementations easier since they don't need to "parse" data headers.) * Memory map revlogs and use multi-threaded I/O *as part of the multi-threaded decode API* using optimal I/O access patterns. You may think that optimizing for memory-mapped based multi-threaded I/O is premature optimization. But consider that modern decompression formats like lz4 and zstd can decode at several hundred MB/s at the input end. And further consider that modern SSDs only achieve their "sticker" read performance at queue depths greater than 1. In other words, you need to have multiple read requests to the storage layer in flight in order to achieve peak read throughput. And a single thread threading a file descriptor won't be able to do that like multiple threads triggering paging from a memory mapped file will. So deferring I/O to a multi-threaded reader will make I/O less of a bottleneck. And to prove these kinds of performance optimizations are possible, python-zstandard has a multi-threaded decompress API. You pass it a special data structure representing a memory segment and an array of (offset, length) tuples and it fans out to multiple threads to do compression. Mercurial "just" needs to ensure it is passing zstd data. With this thought experiment completed and us having arrived at a justification for declaring the encoding of revision data out-of-band from the data itself, this commit introduces a new revlog format: version 2. The highlight of revlog v2 (so far) is a per-revlog feature flag denoting the default revision encoding format and per-revision flags denoting the encoding strategy used. The per-revision flags are stored in the index, which (at least in current implementations) is already loaded into memory at revlog open time. In revlog v2, there is no need to perform I/O or peek at revision data to determine which encoding a specific revision is using. Instead, when you want to decode multiple revisions you can scan the already-loaded index and group revisions by encoding then dispatch to a multi-threaded reader (which does I/O and decoding) *without having to implement revlog-specific knowledge in the reader*. The design of the flags assumes that individual revlogs will be homogonous in terms of revision encoding: a revision will either be encoded with the default encoding or not encoded. This reflects the reality that some revisions should be stored raw because it makes no sense to encode them. For example, some small inputs are larger compressed than raw. We shouldn't waste bytes to store these compressed and burn CPU to decompress them. So, we have a per-revision flag indicating indicating whether the default encoding is used to facilitate this common case. And because extensibility is nice, we also have a per-revision flag indicating whether the encoding is stored inline (like in existing revlog formats). This encoding is not ideal from a performance standpoint, but it does facilitate experimentation in extensions land. For default encoding formats, I added "not encoded" to facilitate some special cases. For example, a server operator may wish to disable revlog compression to facilitate faster reads. We could do this without a revlog flag. But I think representing "I created this revlog with the intent to never compress its entries" is something worth expressing in the header, as this this tells all future consumers to write raw data without relying on a config option to force that behavior. I added zstd as an encoding. This should not be too controversial. We know we want to support zstd in revlogs. zstd revlogs necessitate a new repo requirement and new repo requirements mean you can invent new formats since you are locking out legacy readers. So it seems acceptable to invent a more optimal revlog version 2 while we're here. I also added lz4. Facebook has an lz4 revlog extension. And lz4 in revlogs makes a *lot* of sense. While I have no concrete plans to add lz4 support to core Mercurial at this time (like I do with zstd), I think we should reserve lz4 a seat at the table. Note that in its current design, we /could/ shoehorn encoding related per-revision flags into revlog version 1. However, it is a bit constraining. You would need to design this such that a per-revision flag indicated the new behavior (such as to be compatible with the existing format). Then, you would need to define the encoding format outside of the revlog, such as in a repo requirements file. While certainly possible, I think this places too many constraints on operations. For example, we may want to encode changelogs differently from manifests differently from filelogs. We may even wish to encode different filelogs differently! Having the default encoding in the revlog itself buys you the flexibility to experiment with these changes easily. TODO * We need to consider how zstd dictionaries (or other per-encoding supplementary data) can be indicated or stored within the revlog. This /could/ be a default encoding feature flag value indicating "zstd w/ dictionary". * Since we're inventing a new revlog format, we should also support hashes that aren't SHA-1. * Nuke the (complex) optimization that the 4 byte header is shared with first revision's index? * Store integers in index as little endian (to facilitate 0-copy parsers)? * Record raw chunk size (so we can pre-allocate exact-sized buffer for decoded/decompressed chunk)? * Record delta chain base explicitly? (Minor perf wins for insertion.) * Since we're inventing a new revlog format, we can also do anything else we want to do (I think marmoute wants to store ancestors count in changelog or something) diff --git a/mercurial/help/internals/revlogs.txt b/mercurial/help/internals/revlogs.txt --- a/mercurial/help/internals/revlogs.txt +++ b/mercurial/help/internals/revlogs.txt @@ -45,6 +45,9 @@ 0 1 RevlogNG (*next generation*). It replaced version 0 when it was implemented in 2006. +2 + Support for default encoding feature flag and per-revision encoding flags. + Available in Mercurial 4.3 (released August 2017). The feature flags short consists of bit flags. Where 0 is the least significant bit, the following bit offsets define flags: @@ -53,8 +56,18 @@ 0 Store revision data inline. 1 Generaldelta encoding. +2-4 + Default revision data encoding (version 2 and newer only). -2-15 + 0 Inline encoding. The encoding of each revision is defined inline + in the data using a 1 byte header. + 1 No encoding / raw data. + 2 zlib compression. + 3 zstd compression (includes zstd frame header). + 4 lz4 compression (includes lz4 frame header). (Not supported by + official Mercurial distribution.) + 5-7 Reserved for future use. +5-15 Reserved for future use. The following header values are common: @@ -99,6 +112,10 @@ 6-7 (2 bytes) 2: REVIDX_EXTSTORED revision data is stored externally. + 3: REVIDX_DEFAULT_ENCODING (revlog v2 and later only) revision data is + encoded using the default as specified by the feature flag in the revlog + header. If unset, data is stored raw, without encoding. + 8-11 (4 bytes) Compressed length of revision data / chunk as stored in revlog. @@ -185,11 +202,15 @@ The actual layout of revlog files on dis *store format*. Typically, a ``.i`` file represents the index revlog (possibly containing inline data) and a ``.d`` file holds the revision data. -Revision Entries -================ +Encoding of Revision Entries +============================ -Revision entries consist of an optional 1 byte header followed by an -encoding of the revision data. The headers are as follows: +Revision entries are stored either as raw data or are encoded (often with +compression). + +Revlog versions ``0`` and ``1`` store the encoding format inline with +revision data via a 1 byte header. The recognized header values are as +follows: \0 (0x00) Revision data is the entirety of the entry, including this header. @@ -200,6 +221,17 @@ x (0x78) The 0x78 value is actually the first byte of the zlib header (CMF byte). +Revlog version ``2`` introduces a revlog feature flag to denote the default +encoding for the revlog as well as per-revision flags defining encoding for +each revision. + +Version ``2`` readers **must** consult the ``REVIDX_DEFAULT_ENCODING`` flag +for each revision. If ``REVIDX_DEFAULT_ENCODING`` is set, the default encoding +as specified by the revlog feature flag is used. Otherwise, revision data is +stored raw, sans encoding. If the default encoding of the revlog is ``0`` +(encoding defined inline), revision data is interpreted the same as it is for +revlog versions ``0`` and ``1``. + Hash Computation ================ _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel