Re: [RFC] Using LZ4 compression by default

2017-08-23 Thread Evgeny Kotkov
Stefan Sperling  writes:

>> With all that in mind, I propose that we do (2).  Any objections?
>
> I want (2) as well. Thanks for doing this work, and for very clearly
> and transparently describing the tradeoffs and our options.

Johan Corveleyn  writes:

>> With all that in mind, I propose that we do (2).  Any objections?
>
> I want (2) as well. Thanks for doing this work, and for very clearly
> and transparently describing the tradeoffs and our options.
>
> +1

Thank you for the comments.

I committed the core change that makes LZ4 the new default for
format 8 repositories in https://svn.apache.org/r1805897


Regards,
Evgeny Kotkov


Re: [RFC] Using LZ4 compression by default

2017-08-19 Thread Johan Corveleyn
Op 19 aug. 2017 11:40 schreef "Stefan Sperling" :

On Fri, Aug 18, 2017 at 10:58:13PM +0300, Evgeny Kotkov wrote:
> With all that in mind, I propose that we do (2).  Any objections?

I want (2) as well. Thanks for doing this work, and for very clearly
and transparently describing the tradeoffs and our options.


+1

-- 
Johan


Re: [RFC] Using LZ4 compression by default

2017-08-19 Thread Stefan Sperling
On Fri, Aug 18, 2017 at 10:58:13PM +0300, Evgeny Kotkov wrote:
> With all that in mind, I propose that we do (2).  Any objections?

I want (2) as well. Thanks for doing this work, and for very clearly
and transparently describing the tradeoffs and our options.


Re: [RFC] Using LZ4 compression by default

2017-08-18 Thread Jacek Materna
(2) is the best path for USERS of subversion. More toggles is mired in risk
and adding complexity. Improvements should "just work" out the box - unless
there is some technical hurdle. A 25% increase in disk usage is nothing
today for even a fraction more speed on operations happening thousands of
times a day on a typical team. However, this is more than a fraction!

Great quantitative metrics Evgeny.

On Fri, Aug 18, 2017 at 2:58 PM, Evgeny Kotkov 
wrote:

> Evgeny Kotkov  writes:
>
> >  (B) For the on-disk data, we start using LZ4 compression by default
> >  (in format 8 repositories).
> >
> >  The reasoning behind this is that currently, zlib compression is a
> >  hotspot that can limit the performance of both read and write
> >  operations on the repository.  It also affects how well Subversion
> >  works when dealing with large and, possibly, incompressible files
> >  (and I tend to think that it's a fairly important use case).
> >
> >  Switching to a faster compression algorithm that is also used by
> other
> >  various file system implementations should improve the performance
> of
> >  such operations in a visible way.  Please note that this change is a
> >  trade-off between the compression ratio and speed of the operations.
> >  The repositories using LZ4 compression would require a bit more disk
> >  space.  The amount of the required additional space is proportional
> >  to the difference between the compression ratio of LZ4 and zlib-5,
> >  which can be roughly estimated as around 30-35% for compressible
> >  binary and text files, although that may vary depending on the
> >  actual data.
> >
> > To illustrate how these changes will affect the speed of some of the
> > operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> > environment takes 18 seconds instead of 63 seconds.
>
> Here are some additional zlib-5 vs. LZ4 benchmarks to consider:
>
>   (All tests were performed on the SSD drive using the file:// protocol.
>The results should be interpreted as "before is zlib-5, after is LZ4".
>Also, the results over http:// are somewhat similar in terms of the
>improvement factor and are omitted for brevity.  "Import time " is
>for "svn import", "Export time" is for "svnbench null-export".)
>
>  - One compressible file, 1.17 GB:
>
>Import time:  40.79 s  →  11.97 s   (3.4 x faster)
>Export time:  6.30 s  →  3.13 s   (2.0 x faster)
>Compression ratio:  31.8 %  →  43.8%   (384 MB → 529 MB on disk)
>
>  - One incompressible file, 833 MB:
>
>Import time:  32.16 s  →  8.22 s   (3.9 x faster)
>Export time:  2.71 s  →  2.06 s   (1.3 x faster)
>Compression ratio:  91.9 %  →  93.3%   (766 MB → 778 MB on disk)
>
>  - Multiple source code files (TortoiseSVN trunk), 213 MB, ~7,000 files:
>
>Import time:  17.83 s  →  10.36 s   (1.7 x faster)
>Export time:  1.62 s  →  1.15 s   (1.4 x faster)
>Compression ratio:  35.2 %  →  48.8 %   (75 MB → 104 MB on disk)
>
>  - Multiple binary files, 1.68 GB, 25 files:
>
>Import time:  55.10 s  →  15.84 s   (3.5 x faster)
>Export time:  8.56 s  →  4.34 s   (2.0 x faster)
>Compression ratio:  38.4 %  →  46.9 %   (662 MB → 807 MB on disk)
>
>
> Reiterating over the whole topic of the default compression algorithm for
> the repositories, I think that we have the following options to choose
> from:
>
>  (1) Make LZ4 compression optional in format 8 repositories, and still use
>  zlib-5 compression by default.
>
> With this approach, users would have to have "compression=lz4" in
> fsfs.conf to use it.  Personally, I would expect a number of such users
> to be quite low, because they would have to both upgrade the repository
> to fsfs format 8 and use non-default fsfs.conf settings.
>
> This option means that we'd keep our existing performance
> characteristics
> with read and write operations being limited by the compression speed
> of zlib-5 (which isn't exactly fast) for most of the users.  It also
> means
> that the expected size and the compression ratio of the repository data
> would remain unchanged.
>
>  (2) Compress with LZ4 by default in all (new and upgraded) format 8
>  repositories.
>
> This approach means that a much bigger part of our users will have
> their data compressed with LZ4, and will get the visible read and write
> performance improvement.  It also means that the compression ratio of
> the on disk data will be lower than with zlib-5, and the projected
> size of the repositories will increase accordingly.
>
> One additional point to consider here is that such change may be
> going a bit against the policy of adding a new optional feature and
> switching the default in the next minor release.
>
>  (3) Compress with LZ4 by default, but only in new format 8 repositories.
>
> This option is similar to (2), but with a more limited scope where
>   

Re: [RFC] Using LZ4 compression by default

2017-08-18 Thread Evgeny Kotkov
Evgeny Kotkov  writes:

>  (B) For the on-disk data, we start using LZ4 compression by default
>  (in format 8 repositories).
>
>  The reasoning behind this is that currently, zlib compression is a
>  hotspot that can limit the performance of both read and write
>  operations on the repository.  It also affects how well Subversion
>  works when dealing with large and, possibly, incompressible files
>  (and I tend to think that it's a fairly important use case).
>
>  Switching to a faster compression algorithm that is also used by other
>  various file system implementations should improve the performance of
>  such operations in a visible way.  Please note that this change is a
>  trade-off between the compression ratio and speed of the operations.
>  The repositories using LZ4 compression would require a bit more disk
>  space.  The amount of the required additional space is proportional
>  to the difference between the compression ratio of LZ4 and zlib-5,
>  which can be roughly estimated as around 30-35% for compressible
>  binary and text files, although that may vary depending on the
>  actual data.
>
> To illustrate how these changes will affect the speed of some of the
> operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> environment takes 18 seconds instead of 63 seconds.

Here are some additional zlib-5 vs. LZ4 benchmarks to consider:

  (All tests were performed on the SSD drive using the file:// protocol.
   The results should be interpreted as "before is zlib-5, after is LZ4".
   Also, the results over http:// are somewhat similar in terms of the
   improvement factor and are omitted for brevity.  "Import time " is
   for "svn import", "Export time" is for "svnbench null-export".)

 - One compressible file, 1.17 GB:

   Import time:  40.79 s  →  11.97 s   (3.4 x faster)
   Export time:  6.30 s  →  3.13 s   (2.0 x faster)
   Compression ratio:  31.8 %  →  43.8%   (384 MB → 529 MB on disk)

 - One incompressible file, 833 MB:

   Import time:  32.16 s  →  8.22 s   (3.9 x faster)
   Export time:  2.71 s  →  2.06 s   (1.3 x faster)
   Compression ratio:  91.9 %  →  93.3%   (766 MB → 778 MB on disk)

 - Multiple source code files (TortoiseSVN trunk), 213 MB, ~7,000 files:

   Import time:  17.83 s  →  10.36 s   (1.7 x faster)
   Export time:  1.62 s  →  1.15 s   (1.4 x faster)
   Compression ratio:  35.2 %  →  48.8 %   (75 MB → 104 MB on disk)

 - Multiple binary files, 1.68 GB, 25 files:

   Import time:  55.10 s  →  15.84 s   (3.5 x faster)
   Export time:  8.56 s  →  4.34 s   (2.0 x faster)
   Compression ratio:  38.4 %  →  46.9 %   (662 MB → 807 MB on disk)


Reiterating over the whole topic of the default compression algorithm for
the repositories, I think that we have the following options to choose from:

 (1) Make LZ4 compression optional in format 8 repositories, and still use
 zlib-5 compression by default.

With this approach, users would have to have "compression=lz4" in
fsfs.conf to use it.  Personally, I would expect a number of such users
to be quite low, because they would have to both upgrade the repository
to fsfs format 8 and use non-default fsfs.conf settings.

This option means that we'd keep our existing performance characteristics
with read and write operations being limited by the compression speed
of zlib-5 (which isn't exactly fast) for most of the users.  It also means
that the expected size and the compression ratio of the repository data
would remain unchanged.

 (2) Compress with LZ4 by default in all (new and upgraded) format 8
 repositories.

This approach means that a much bigger part of our users will have
their data compressed with LZ4, and will get the visible read and write
performance improvement.  It also means that the compression ratio of
the on disk data will be lower than with zlib-5, and the projected
size of the repositories will increase accordingly.

One additional point to consider here is that such change may be
going a bit against the policy of adding a new optional feature and
switching the default in the next minor release.

 (3) Compress with LZ4 by default, but only in new format 8 repositories.

This option is similar to (2), but with a more limited scope where
LZ4 compression is only used for the new repositories created with
Subversion 1.10 binaries.


Personally, I find the significant speed improvement for both read and write
operations from LZ4 compression quite important, and I think that the actual
reduction in the compression ratio is acceptable, considering the gained
benefits.  I also think that the risks associated with switching the default
on-disk format are low in this particular case, considering that the LZ4
library is stable.  (It has been available for a long time and is used by
projects like Linux Kernel and ZFS).

In other words, I think that we would benefit 

Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Evgeny Kotkov
Branko Čibej  writes:

> Of course latency, for practical purposes, tells you how many gateways
> there are between the client and the server, not what the effective
> bandwidth is.

Agreed.

Overall, my intention here was to improve what I think is a reasonably
common case with the server located "in the same building" and with the
repository containing a lot of large and, possibly, incompressible files
(assets, documents, etc.), without affecting other cases.  Currently, in
such scenario with the default configuration both the HTTP client and the
server are doing _a lot of_ unnecessary compression work, and that visibly
slows things down.

The assumption here is that low-latency connections should most likely have
enough bandwidth to cover the difference between compression ratios of
LZ4 and zlib-5, and allow us to use the much faster compression algorithm.

(Not too sure if it's even possible to determine the effective bandwidth
 beforehand, considering things like TCP auto-scaling.)

To avoid potential regressions, the current approach will always fall back
to zlib-5.  While there might be cases where it could result in a suboptimal
decision, e.g., for fat networks with medium/high latency (and that's not
so obvious, as the traffic can have a cost), I think that, probably, it
should work well in practice for the case described above and avoid
regressions in other cases.


Regards,
Evgeny Kotkov


Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Jacek Materna
Based on what we are seeing - disk storage is at least 10x less concerning
across a huge swatch of enterprise SVN users than UX (performance and speed
of developer workflows). It's becoming less and less so each quarter.
Unless the disk storage increase impacts the server side performance in
aggregate I don't understand why the priority (biased defaults) would not
be to increase the speed of every operation.

?

On Fri, Aug 4, 2017 at 9:30 AM, Nathan Hartman 
wrote:

> On Aug 2, 2017, at 2:59 PM, Evgeny Kotkov 
> wrote:
> > With the recently added support for LZ4 compression (r1801940 et al),
> > we now have an option of using it by default for the on-disk data and
> > over the wire.
> > [...]
> > The amount of the required additional space is proportional
> > to the difference between the compression ratio of LZ4 and zlib-5,
> > which can be roughly estimated as around 30-35% for compressible
> > binary and text files, although that may vary depending on the
> > actual data.
> >
> > To illustrate how these changes will affect the speed of some of the
> > operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> > environment takes 18 seconds instead of 63 seconds.
>
> Regarding on disk storage, for small repos a 30% size increase is probably
> not material, but it may be significant for large repos. Is it feasible to
> get the best of both worlds by using LZ4 for fast commits and then
> recompress using zlib in svnadmin pack?




-- 

Jacek Materna
Chief Technology Officer

Assembla
+1 210 410 7661


Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Nathan Hartman
On Aug 2, 2017, at 2:59 PM, Evgeny Kotkov  wrote:
> With the recently added support for LZ4 compression (r1801940 et al),
> we now have an option of using it by default for the on-disk data and
> over the wire.
> [...]
> The amount of the required additional space is proportional
> to the difference between the compression ratio of LZ4 and zlib-5,
> which can be roughly estimated as around 30-35% for compressible
> binary and text files, although that may vary depending on the
> actual data.
> 
> To illustrate how these changes will affect the speed of some of the
> operations, the 'svn import' of a 2 GB file over HTTP on LAN in my
> environment takes 18 seconds instead of 63 seconds.

Regarding on disk storage, for small repos a 30% size increase is probably not 
material, but it may be significant for large repos. Is it feasible to get the 
best of both worlds by using LZ4 for fast commits and then recompress using 
zlib in svnadmin pack?

Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Evgeny Kotkov
Paul Hammant  writes:

> Wouldn't the svn client just speculatively specify a HTTP "Accepts" header
> with requests up to the server?  You'd be able to do back/forwards
> compatibility with that, and not have to change any other wire spec ?

I think that this is close to how things currently work, and what makes
the compatibility possible.  However, it's still up to the client to decide,
what kind of compressed data (LZ4 / zlib or uncompressed ) it wants more,
so the client advertises this preference via the ;q= quality parameter.

Perhaps, for historical reasons, we don't use separate "Accept" and
"Accept-Encoding" headers.

In other words, currently, instead of something like this

  Accept-Encoding: gzip
  Accept: application/vnd.svn-svndiff;format=2;q=0.9,
  application/vnd.svn-svndiff;format=1;q=0.8, ...

a client sends

  Accept-Encoding: gzip,svndiff2;q=0.9,svndiff1;q=0.8, ...

("gzip" is there as the server might be configured to send uncompressed
 deltas, but gzip the whole response.)

Another thing to consider is that Accept-Encoding headers only handle the
part where the server sends data to the client.  In order for the client
to properly prepare the request body that the server can interpret, there's
an additional layer of capability negotiation that happens during the first
OPTIONS request.


Regards,
Evgeny Kotkov


Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Branko Čibej
On 04.08.2017 15:11, Evgeny Kotkov wrote:
> Branko Čibej  writes:
>> To make matters more interesting, when I'm working remotely, I can
>> access the office SVN server either remotely through an HTTPS proxy, or
>> "locally" by using our VPN ... then my IP address is on the same subnet
>> as the server's, but it's still working on a WAN.
> There is a known limitation of serf_connection_get_latency() that currently,
> it doesn't determine the latency for proxied connections.  In this case,
> this wouldn't matter, as the proxy is used for remote work, where we
> would want to keep the slower zlib algorithm.  However, if the access
> to the local server happens through a proxy, we won't be currently able
> to use the faster compression algorithm.  (Perhaps, we could improve on
> this in the future.)

It turns out that I mislead you a bit ... in the "WAN" case it's not a
proxy, it's a simple DNAT/port-forward to the SVN server. So the latency
calculation should be effectively correct.

Of course latency, for practical purposes, tells you how many gateways
there are between the client and the server, not what the effective
bandwidth is.



-- Brane



Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Paul Hammant
Wouldn't the svn client just speculatively specify a HTTP "Accepts" header
 with
requests up to the server?  You'd be able to do back/forwards compatibility
with that, and not have to change any other wire spec ?


Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Evgeny Kotkov
Branko Čibej  writes:

>>  - Using LZ4 over the wire requires both endpoints to advertise that they
>>know how to deal with the new svndiff2 format that allows LZ4 compression.
>
> "Allows" or "requires"? I expect that anything in svndiff2 that's
> compressed uses LZ4?

In this sense, "requires": svndiff2 chunks that don't bloat from the
compression are always compressed with LZ4.

What I was trying to say here when I wrote that, I think, is that among
all svndiff formats, svndiff2 is the only one that "allows" using LZ4
compression.

> How do you detect that the network is "local"?

By comparing the latency against a predefined threshold (currently, 5 ms).

> I'm not too happy with the idea of another client-side knob for this.
> For example, I usually use my SVN client over the WAN but sometimes
> bring the laptop to the office ... so I'd "have" to either keep changing
> my client configuration (not viable) or configure for the worst case
> (which defeats the purpose of having LZ4 as an option). A server-side
> setting doesn't help, either, since the server has no idea where the
> client is accessing it from.

The "http-compression" client-side knob is not new, I think that we had it
for some time now.

The difference is that previously allowed values were "yes(default)/no",
and the compression has always been enabled for the majority of the users.
Now, it is "auto(default)/yes/no", where the new 'auto' value behaves as
'yes', but switches to the faster LZ4 compression for local networks.
The worst-case behavior doesn't change, and it behaves just as it would
behave previously by using the slower compression with a better ratio.

Unless I am missing something, the new 'auto' default should work reasonably
well, or at least, result in an improvement in the case you have described:

 - When working over WAN, the client would use slower compression with
   a better ratio, as it has been doing previously.

 - Once the laptop is brought to the office, the client sees the decreased
   latency and switches to the faster LZ4 compression.

> To make matters more interesting, when I'm working remotely, I can
> access the office SVN server either remotely through an HTTPS proxy, or
> "locally" by using our VPN ... then my IP address is on the same subnet
> as the server's, but it's still working on a WAN.

There is a known limitation of serf_connection_get_latency() that currently,
it doesn't determine the latency for proxied connections.  In this case,
this wouldn't matter, as the proxy is used for remote work, where we
would want to keep the slower zlib algorithm.  However, if the access
to the local server happens through a proxy, we won't be currently able
to use the faster compression algorithm.  (Perhaps, we could improve on
this in the future.)


Thanks,
Evgeny Kotkov


Re: [RFC] Using LZ4 compression by default

2017-08-04 Thread Branko Čibej
On 02.08.2017 20:59, Evgeny Kotkov wrote:

[...]

>  - Using LZ4 over the wire requires both endpoints to advertise that they
>know how to deal with the new svndiff2 format that allows LZ4 compression.

"Allows" or "requires"? I expect that anything in svndiff2 that's
compressed uses LZ4?


[...]


> I propose the following approach.  Please note that for the wire format
> part, it only considers the http:// protocol, but we can optionally adjust
> svn:// later:
>
>  (A) For the HTTP wire format, we start using LZ4 compression by default,
>  but only over local networks.

How do you detect that the network is "local"?

>  The reasoning behind this is that we probably wouldn't want to start
>  always using LZ4 compression, as that would result in a regression over
>  WAN, where the better compression ratio is usually preferable to the
>  compression performance.  Another point is that even for local networks
>  we cannot disable compression altogether, because for slow 10 or even
>  100 Mbps LANs, where the throughput is limited by the slow network,
>  using fast compression can be better than no compression.  This is
>  where LZ4 comes to the rescue by offering reasonable compression
>  ratio and fast compression speed.
>
>  This approach is currently implemented with the http-compression=auto
>  client-side configuration option (r1803899), which is the new default.
>  While the HTTP client is generally in charge of the used compression
>  algorithm, there's also a way to override its preference on the server.
>  If the mod_dav_svn's SVNCompressionLevel directive is set to 1, a
>  server would then override the client's preference and still send
>  svndiff2 / LZ4 data if the client can accept it.

I'm not too happy with the idea of another client-side knob for this.
For example, I usually use my SVN client over the WAN but sometimes
bring the laptop to the office ... so I'd "have" to either keep changing
my client configuration (not viable) or configure for the worst case
(which defeats the purpose of having LZ4 as an option). A server-side
setting doesn't help, either, since the server has no idea where the
client is accessing it from.

To make matters more interesting, when I'm working remotely, I can
access the office SVN server either remotely through an HTTPS proxy, or
"locally" by using our VPN ... then my IP address is on the same subnet
as the server's, but it's still working on a WAN.


-- Brane