(2) is the best path for USERS of subversion. More toggles is mired in risk
and adding complexity. Improvements should "just work" out the box - unless
there is some technical hurdle. A 25% increase in disk usage is nothing
today for even a fraction more speed on operations happening thousands of
times a day on a typical team. However, this is more than a fraction!

Great quantitative metrics Evgeny.

On Fri, Aug 18, 2017 at 2:58 PM, Evgeny Kotkov <evgeny.kot...@visualsvn.com>
wrote:

> Evgeny Kotkov <evgeny.kot...@visualsvn.com> writes:
>
> >  (B) For the on-disk data, we start using LZ4 compression by default
> >      (in format 8 repositories).
> >
> >      The reasoning behind this is that currently, zlib compression is a
> >      hotspot that can limit the performance of both read and write
> >      operations on the repository.  It also affects how well Subversion
> >      works when dealing with large and, possibly, incompressible files
> >      (and I tend to think that it's a fairly important use case).
> >
> >      Switching to a faster compression algorithm that is also used by
> other
> >      various file system implementations should improve the performance
> of
> >      such operations in a visible way.  Please note that this change is a
> >      trade-off between the compression ratio and speed of the operations.
> >      The repositories using LZ4 compression would require a bit more disk
> >      space.  The amount of the required additional space is proportional
> >      to the difference between the compression ratio of LZ4 and zlib-5,
> >      which can be roughly estimated as around 30-35% for compressible
> >      binary and text files, although that may vary depending on the
> >      actual data.
> >
> > To illustrate how these changes will affect the speed of some of the
> > operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> > environment takes 18 seconds instead of 63 seconds.
>
> Here are some additional zlib-5 vs. LZ4 benchmarks to consider:
>
>   (All tests were performed on the SSD drive using the file:// protocol.
>    The results should be interpreted as "before is zlib-5, after is LZ4".
>    Also, the results over http:// are somewhat similar in terms of the
>    improvement factor and are omitted for brevity.  "Import time " is
>    for "svn import", "Export time" is for "svnbench null-export".)
>
>  - One compressible file, 1.17 GB:
>
>    Import time:  40.79 s  →  11.97 s   (3.4 x faster)
>    Export time:  6.30 s  →  3.13 s   (2.0 x faster)
>    Compression ratio:  31.8 %  →  43.8%   (384 MB → 529 MB on disk)
>
>  - One incompressible file, 833 MB:
>
>    Import time:  32.16 s  →  8.22 s   (3.9 x faster)
>    Export time:  2.71 s  →  2.06 s   (1.3 x faster)
>    Compression ratio:  91.9 %  →  93.3%   (766 MB → 778 MB on disk)
>
>  - Multiple source code files (TortoiseSVN trunk), 213 MB, ~7,000 files:
>
>    Import time:  17.83 s  →  10.36 s   (1.7 x faster)
>    Export time:  1.62 s  →  1.15 s   (1.4 x faster)
>    Compression ratio:  35.2 %  →  48.8 %   (75 MB → 104 MB on disk)
>
>  - Multiple binary files, 1.68 GB, 25 files:
>
>    Import time:  55.10 s  →  15.84 s   (3.5 x faster)
>    Export time:  8.56 s  →  4.34 s   (2.0 x faster)
>    Compression ratio:  38.4 %  →  46.9 %   (662 MB → 807 MB on disk)
>
>
> Reiterating over the whole topic of the default compression algorithm for
> the repositories, I think that we have the following options to choose
> from:
>
>  (1) Make LZ4 compression optional in format 8 repositories, and still use
>      zlib-5 compression by default.
>
>     With this approach, users would have to have "compression=lz4" in
>     fsfs.conf to use it.  Personally, I would expect a number of such users
>     to be quite low, because they would have to both upgrade the repository
>     to fsfs format 8 and use non-default fsfs.conf settings.
>
>     This option means that we'd keep our existing performance
> characteristics
>     with read and write operations being limited by the compression speed
>     of zlib-5 (which isn't exactly fast) for most of the users.  It also
> means
>     that the expected size and the compression ratio of the repository data
>     would remain unchanged.
>
>  (2) Compress with LZ4 by default in all (new and upgraded) format 8
>      repositories.
>
>     This approach means that a much bigger part of our users will have
>     their data compressed with LZ4, and will get the visible read and write
>     performance improvement.  It also means that the compression ratio of
>     the on disk data will be lower than with zlib-5, and the projected
>     size of the repositories will increase accordingly.
>
>     One additional point to consider here is that such change may be
>     going a bit against the policy of adding a new optional feature and
>     switching the default in the next minor release.
>
>  (3) Compress with LZ4 by default, but only in new format 8 repositories.
>
>     This option is similar to (2), but with a more limited scope where
>     LZ4 compression is only used for the new repositories created with
>     Subversion 1.10 binaries.
>
>
> Personally, I find the significant speed improvement for both read and
> write
> operations from LZ4 compression quite important, and I think that the
> actual
> reduction in the compression ratio is acceptable, considering the gained
> benefits.  I also think that the risks associated with switching the
> default
> on-disk format are low in this particular case, considering that the LZ4
> library is stable.  (It has been available for a long time and is used by
> projects like Linux Kernel and ZFS).
>
> In other words, I think that we would benefit from using LZ4 compression
> by default.
>
> Among the options (2) and (3) that make LZ4 the new default compression
> algorithm, I think that option (2) is better.  The reasoning here is that
> using LZ4 compression would improve the performance even for existing
> repositories by making new commits faster and by speeding up read
> operations for the new committed files.  Apart from this, option (3)
> needs implementation and is probably going to have a couple of related
> challenges, which can be otherwise avoided.
>
> With all that in mind, I propose that we do (2).  Any objections?
>
>
> Thanks,
> Evgeny Kotkov
>



-- 

Jacek Materna
Chief Technology Officer

Assembla
+1 210 410 7661

Reply via email to