# HG changeset patch # User Gregory Szorc <gregory.sz...@gmail.com> # Date 1483401236 28800 # Mon Jan 02 15:53:56 2017 -0800 # Node ID cd8f8b1b8dbae0cf2be79a732a936f4105881672 # Parent 96837453d041fe7289dd3e6450aee4961517c5b5 [RFC] localrepo: support non-zlib compression engines in revlogs
The final part of integrating the compression manager APIs into revlog storage is the plumbing for repositories to advertise they are using non-zlib storage and for revlogs to instantiate a non-zlib engine. Introduced is a format.compengine config option to control the compression engine to use. It defaults to zlib, which is backwards compatible. localrepo.newreporequirements() looks at this config option and adds a "compengine*" requirement on the named compression engine if it isn't "zlib." Doing it at this layer means that existing repositories will forever create new data as the engine they were initially created with: only new repositories get the new compression engine. The svfs opener options now define a "compengine" key denoting the preferred compression engine. This comes from the repo's active requirements. A localrepository instance now updates its supported requirements to include "compengine*" values for compression engines registered with the compression engine manager. Finally, revlog instances have been taught to instantiate non-zlib compression options if the opener options instruct it to. This patch means that we can use zstd compression in revlogs! On an `hg pull` conversion of the mozilla-unified repo with no extra redelta settings (like aggressivemergedeltas), we can see the impact of zstd vs zlib in revlogs: $ hg perfrevlogchunks -c ! chunk ! wall 2.032052 comb 2.040000 user 1.990000 sys 0.050000 (best of 5) ! wall 1.866360 comb 1.860000 user 1.820000 sys 0.040000 (best of 6) ! chunk batch ! wall 1.877261 comb 1.870000 user 1.860000 sys 0.010000 (best of 6) ! wall 1.705410 comb 1.710000 user 1.690000 sys 0.020000 (best of 6) $ hg perfrevlogchunks -m ! chunk ! wall 2.721427 comb 2.720000 user 2.640000 sys 0.080000 (best of 4) ! wall 2.035076 comb 2.030000 user 1.950000 sys 0.080000 (best of 5) ! chunk batch ! wall 2.614561 comb 2.620000 user 2.580000 sys 0.040000 (best of 4) ! wall 1.910252 comb 1.910000 user 1.880000 sys 0.030000 (best of 6) $ hg perfrevlog -c -d 1 ! wall 4.812885 comb 4.820000 user 4.800000 sys 0.020000 (best of 3) ! wall 4.699621 comb 4.710000 user 4.700000 sys 0.010000 (best of 3) $ hg perfrevlog -m -d 1000 ! wall 34.252800 comb 34.250000 user 33.730000 sys 0.520000 (best of 3) ! wall 24.094999 comb 24.090000 user 23.320000 sys 0.770000 (best of 3) Only modest wins for the changelog. But manifest reading is significantly faster. I presume this is because manifest chunks tend to be larger and zstd's performance advantage is more apparent the longer it runs. debugcreatestreamclonebundle (size in bytes) zlib: 1,638,852,492 zstd: 1,680,601,332 I'm not sure why zstd on disk is larger. It could be overhead of its framing headers. It could also be poor delta parent choices, as I didn't force redelta on the zstd conversion of the repo. $ hg bundle -a -t none-v2 (total CPU time) zlib: 102.79s zstd: 97.75s So, marginal CPU decrease for reading all chunks in all revlogs (this is somewhat disappointing). $ hg bundle -a -t <engine>-v2 (total CPU time) zlib: 191.59s zstd: 115.36s This last test effectively measures the difference between zlib->zlib and zstd->zstd for revlogs to bundle. This is a rough approximation of what a server does during `hg clone`. Also, I haven't exhaustively tested and profiled various zstd APIs for decompression. The C bindings expose a handful of ways to perform decompression and it's quite possible there are some performance optimizations on the table there. FWIW, I have a local patch series where I enable support for lz4 revlogs using the same lz4 C extension used by Facebook's lz4revlogs extension and surprisingly zstd's decompress speed is faster than lz4! This shouldn't be true because, well, it is assumed lz4 decompress is insanely fast. I suspect this has more to do with the lz4 C extension doing something silly, like reinitializing contexts. But that's a rabbit hole for another day. --- Things to hash out making this an RFC quality patch: * Is using a wildcard in the requirement name acceptable? Perhaps we should use e.g. compengine=zstd instead? * Does the requirement need to be specific to revlogs? * Should the requirement name be explicitly declared by the engine instead of inferred by the engine name? (A extra layer of abstraction may help here if a single engine ever supports multiple requirements, like the zstd one will if dictionary compression is added.) * Do more data due diligence and record in commit message. * stream clone bundle requirements testing * More tests around revlog format (verification of zstd frame header) diff --git a/mercurial/help/config.txt b/mercurial/help/config.txt --- a/mercurial/help/config.txt +++ b/mercurial/help/config.txt @@ -687,6 +687,11 @@ Example for ``~/.hgrc``:: ``format`` ---------- +``compengine`` + Compression engine to use for storage. + + Default: ``zlib`` + ``usegeneraldelta`` Enable or disable the "generaldelta" repository format which improves repository compression by allowing "revlog" to store delta against arbitrary diff --git a/mercurial/help/internals/requirements.txt b/mercurial/help/internals/requirements.txt --- a/mercurial/help/internals/requirements.txt +++ b/mercurial/help/internals/requirements.txt @@ -106,3 +106,12 @@ one manifest per directory (as opposed t Support for this requirement was added in Mercurial 3.4 (released August 2015). The requirement is currently experimental and is disabled by default. + +compengine* +=========== + +Denotes a compression engine required to open files in the repository. +``*`` is a compression engine name, such as ``zstd``. + +Support for this requirement was added in Mercurial 4.1 (released +February 2017). diff --git a/mercurial/localrepo.py b/mercurial/localrepo.py --- a/mercurial/localrepo.py +++ b/mercurial/localrepo.py @@ -284,6 +284,12 @@ class localrepository(object): else: self.supported = self._basesupported + # Add compression engines. + for name in util.compengines: + engine = util.compengines[name] + if engine.revlogheader: + self.supported.add('compengine%s' % name) + if not self.vfs.isdir(): if create: self.requirements = newreporequirements(self) @@ -397,6 +403,10 @@ class localrepository(object): self.svfs.options['aggressivemergedeltas'] = aggressivemergedeltas self.svfs.options['lazydeltabase'] = not scmutil.gddeltaconfig(self.ui) + for r in self.requirements: + if r.startswith('compengine'): + self.svfs.options['compengine'] = r[len('compengine'):] + def _writerequirements(self): scmutil.writerequires(self.vfs, self.requirements) @@ -1994,6 +2004,17 @@ def newreporequirements(repo): if ui.configbool('format', 'dotencode', True): requirements.add('dotencode') + compengine = ui.config('format', 'compengine', 'zlib') + if compengine not in util.compengines: + raise error.Abort(_('compression engine %s defined by ' + 'format.compengine not available') % compengine, + hint=_('run "hg debuginstall" to list available ' + 'compression engines')) + + # zlib is the historical default and doesn't need an explicit requirement. + if compengine != 'zlib': + requirements.add('compengine%s' % compengine) + if scmutil.gdinitconfig(ui): requirements.add('generaldelta') if ui.configbool('experimental', 'treemanifest', False): diff --git a/mercurial/revlog.py b/mercurial/revlog.py --- a/mercurial/revlog.py +++ b/mercurial/revlog.py @@ -228,6 +228,7 @@ class revlog(object): # Mapping of revision integer to full node. self._nodecache = {nullid: nullrev} self._nodepos = None + self._compengine = 'zlib' v = REVLOG_DEFAULT_VERSION opts = getattr(opener, 'options', None) @@ -244,6 +245,8 @@ class revlog(object): if 'aggressivemergedeltas' in opts: self._aggressivemergedeltas = opts['aggressivemergedeltas'] self._lazydeltabase = bool(opts.get('lazydeltabase', False)) + if 'compengine' in opts: + self._compengine = opts['compengine'] if self._chunkcachesize <= 0: raise RevlogError(_('revlog chunk cache size %r is not greater ' @@ -301,7 +304,7 @@ class revlog(object): @util.propertycache def _compressor(self): - return util.compengines['zlib'].revlogcompressor() + return util.compengines[self._compengine].revlogcompressor() def tip(self): return self.node(len(self.index) - 2) diff --git a/tests/test-help.t b/tests/test-help.t --- a/tests/test-help.t +++ b/tests/test-help.t @@ -1223,6 +1223,8 @@ Separate sections from subsections "format" -------- + "compengine" + "usegeneraldelta" "dotencode" diff --git a/tests/test-repo-compengines.t b/tests/test-repo-compengines.t new file mode 100644 --- /dev/null +++ b/tests/test-repo-compengines.t @@ -0,0 +1,78 @@ +A new repository uses zlib storage, which doesn't require a requirement + + $ hg init default + $ cd default + $ cat .hg/requires + dotencode + fncache + generaldelta + revlogv1 + store + + $ touch foo + $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text to trigger compression' + $ hg debugrevlog -c | grep 0x78 + 0x78 (x) : 1 (100.00%) + 0x78 (x) : 110 (100.00%) + + $ cd .. + +Unknown compression engine to format.compengine aborts + + $ hg --config format.compengine=unknown init unknown + abort: compression engine unknown defined by format.compengine not available + (run "hg debuginstall" to list available compression engines) + [255] + +A requirement specifying an unknown compression engine results in bail + + $ hg init unknownrequirement + $ cd unknownrequirement + $ echo compengineunknown >> .hg/requires + $ hg log + abort: repository requires features unknown to this Mercurial: compengineunknown! + (see https://mercurial-scm.org/wiki/MissingRequirement for more information) + [255] + + $ cd .. + +#if zstd + + $ hg --config format.compengine=zstd init zstd + $ cd zstd + $ cat .hg/requires + compenginezstd + dotencode + fncache + generaldelta + revlogv1 + store + + $ touch foo + $ hg -q commit -A -m 'initial commit with a lot of repeated repeated repeated text' + + $ hg debugrevlog -c | grep 0x28 + 0x28 : 1 (100.00%) + 0x28 : 98 (100.00%) + + $ cd .. + +Specifying a new format.compengine on an existing repo won't introduce data +with that engine or a requirement + + $ cd default + $ touch bar + $ hg --config format.compengine=zstd -q commit -A -m 'add bar with a lot of repeated repeated repeated text' + + $ cat .hg/requires + dotencode + fncache + generaldelta + revlogv1 + store + + $ hg debugrevlog -c | grep 0x78 + 0x78 (x) : 2 (100.00%) + 0x78 (x) : 199 (100.00%) + +#endif _______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel