A few of us started an off-list conversation with Yann Collett (author of lz4 and zstd and general compression guru) about deltas, compression, and zstd. Most of that conversation is suitable for the public domain since it revolves around the open source Mercurial project. So this thread will be the continuation of that discussion.
My initial message (with minor modifications) to Yann follows. --- My immediate goals for Mercurial are to allow zstd to be used as a drop-in replacement for zlib wherever zlib is used, without changing /too/ much of the architecture. For streaming compression (producing standalone "bundle" files and transferring data over the wire protocol), zstd has been nothing short of amazing. https://www.mercurial-scm.org/pipermail/mercurial-devel/ 2016-December/091797.html shows ~60% CPU usage while yielding better compression ratios. You can't ask for much better than that! The other major use of zlib in Mercurial is revlogs. You can read more about them at https://www.mercurial-scm.org/repo/hg/help/internals.revlogs. In short, revlogs are our storage structure for "blob" data. Each logical entity (the set of commits (changelog), the set of files in a commit (manifest), and each file/path tracked in history (filelog)) has its own revlog (read: data within a specific revlog tends to look very similar). Revlogs store data in delta chains. The first entry is stored as fulltext. Then we compute and store deltas/diffs for subsequent entries. Rebuilding the fulltext for a revision involves finding a base fulltext then walking the delta chain to apply deltas until you rebuild the fulltext. We attempt to zlib compress all deltas today, storing the zlib compressed result if it is smaller or plaintext if it isn't. In https://www.mercurial-scm.org/pipermail/mercurial-devel/ 2017-January/091928.html, I'm attempting to replace zlib in revlogs with zstd. It does "just work." But it isn't without its caveats. Many of them are listed in https://www.mercurial-scm.org/pipermail/mercurial-devel/ 2017-January/091982.html. Here are some of the areas where I could use your expertise: * Optimizing read performance. We'd really like changelog reads (changeset/commit data) to be as fast as possible. We've already removed delta chains there. I've also noticed that decompression with dictionaries is substantially faster than without them. Is there further tuning of compression parameters(read: not just integer comp level) to optimize for read speed? (Note: changelog entries tend to be 100-300 bytes raw, with some going into the tens of kilobytes or larger range.) * Everything dictionary compression. See https://www.mercurial-scm.org/ pipermail/mercurial-devel/2017-January/091986.html. It seems like we could get some major compression ratio and speed wins with dictionaries. But that requires complications to build, store, and possibly transfer dictionaries. At the very least, I'd like your thoughts on whether Mercurial should ship pre-built dictionaries for specific use cases so people can get *some* dictionary compression benefits today without us having to implement complex dictionary management. If so, how could you provide guidance on training? * Could we emit/format revlog deltas differently or in a special way to optimize for zstd? * When should we compress versus store uncompressed? Right now, our heuristic for storing a compressed chunk in a revlog is to store compressed if it is smaller than the raw input. This feels sub-optimal to me because if your compressed data is only 1 byte smaller than raw, it feels better to just sacrifice that 1 byte in the name of performance. While I'm certainly capable of measuring this and making this a configurable parameter, I was wondering if you could provide insights for a good heuristic for determining a) should we compress string X b) should we store the compressed result. * Frame header overhead. In some cases, the longer zstd frame header is yielding larger storage sizes than zlib. This is mostly for very small inputs (<150 bytes). We're storing the content size in the frame header (so we can do one-shot decompression without over heap allocating). I don't want to reinvent the wheel, but it is certainly tempting to construct our own frame header for special cases (namely very small inputs). Those are shorter term questions. Down the road, we're thinking about more radical changes to the storage model. We don't want 1 file/revlog per logical tracked entity (due to inode scaling issues among other things). We want to put multiple logical entities in the same container. There's a ton of possibilities for radical compression and encoding techniques here.
_______________________________________________ Mercurial-devel mailing list Mercurial-devel@mercurial-scm.org https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel