Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes: > (B) For the on-disk data, we start using LZ4 compression by default > (in format 8 repositories). > > The reasoning behind this is that currently, zlib compression is a > hotspot that can limit the performance of both read and write > operations on the repository. It also affects how well Subversion > works when dealing with large and, possibly, incompressible files > (and I tend to think that it's a fairly important use case). > > Switching to a faster compression algorithm that is also used by other > various file system implementations should improve the performance of > such operations in a visible way. Please note that this change is a > trade-off between the compression ratio and speed of the operations. > The repositories using LZ4 compression would require a bit more disk > space. The amount of the required additional space is proportional > to the difference between the compression ratio of LZ4 and zlib-5, > which can be roughly estimated as around 30-35% for compressible > binary and text files, although that may vary depending on the > actual data. > > To illustrate how these changes will affect the speed of some of the > operations, the 'svn import' of a 2 GB file over HTTP on LAN in my > environment takes 18 seconds instead of 63 seconds.
Here are some additional zlib-5 vs. LZ4 benchmarks to consider: (All tests were performed on the SSD drive using the file:// protocol. The results should be interpreted as "before is zlib-5, after is LZ4". Also, the results over http:// are somewhat similar in terms of the improvement factor and are omitted for brevity. "Import time " is for "svn import", "Export time" is for "svnbench null-export".) - One compressible file, 1.17 GB: Import time: 40.79 s → 11.97 s (3.4 x faster) Export time: 6.30 s → 3.13 s (2.0 x faster) Compression ratio: 31.8 % → 43.8% (384 MB → 529 MB on disk) - One incompressible file, 833 MB: Import time: 32.16 s → 8.22 s (3.9 x faster) Export time: 2.71 s → 2.06 s (1.3 x faster) Compression ratio: 91.9 % → 93.3% (766 MB → 778 MB on disk) - Multiple source code files (TortoiseSVN trunk), 213 MB, ~7,000 files: Import time: 17.83 s → 10.36 s (1.7 x faster) Export time: 1.62 s → 1.15 s (1.4 x faster) Compression ratio: 35.2 % → 48.8 % (75 MB → 104 MB on disk) - Multiple binary files, 1.68 GB, 25 files: Import time: 55.10 s → 15.84 s (3.5 x faster) Export time: 8.56 s → 4.34 s (2.0 x faster) Compression ratio: 38.4 % → 46.9 % (662 MB → 807 MB on disk) Reiterating over the whole topic of the default compression algorithm for the repositories, I think that we have the following options to choose from: (1) Make LZ4 compression optional in format 8 repositories, and still use zlib-5 compression by default. With this approach, users would have to have "compression=lz4" in fsfs.conf to use it. Personally, I would expect a number of such users to be quite low, because they would have to both upgrade the repository to fsfs format 8 and use non-default fsfs.conf settings. This option means that we'd keep our existing performance characteristics with read and write operations being limited by the compression speed of zlib-5 (which isn't exactly fast) for most of the users. It also means that the expected size and the compression ratio of the repository data would remain unchanged. (2) Compress with LZ4 by default in all (new and upgraded) format 8 repositories. This approach means that a much bigger part of our users will have their data compressed with LZ4, and will get the visible read and write performance improvement. It also means that the compression ratio of the on disk data will be lower than with zlib-5, and the projected size of the repositories will increase accordingly. One additional point to consider here is that such change may be going a bit against the policy of adding a new optional feature and switching the default in the next minor release. (3) Compress with LZ4 by default, but only in new format 8 repositories. This option is similar to (2), but with a more limited scope where LZ4 compression is only used for the new repositories created with Subversion 1.10 binaries. Personally, I find the significant speed improvement for both read and write operations from LZ4 compression quite important, and I think that the actual reduction in the compression ratio is acceptable, considering the gained benefits. I also think that the risks associated with switching the default on-disk format are low in this particular case, considering that the LZ4 library is stable. (It has been available for a long time and is used by projects like Linux Kernel and ZFS). In other words, I think that we would benefit from using LZ4 compression by default. Among the options (2) and (3) that make LZ4 the new default compression algorithm, I think that option (2) is better. The reasoning here is that using LZ4 compression would improve the performance even for existing repositories by making new commits faster and by speeding up read operations for the new committed files. Apart from this, option (3) needs implementation and is probably going to have a couple of related challenges, which can be otherwise avoided. With all that in mind, I propose that we do (2). Any objections? Thanks, Evgeny Kotkov