Hi,

I'm be glad to help you with the auto-detection, as I wrote that code (a long 
time ago). As I said, it's not a "perfect" solution, and you might want to 
tweak it for best results.

I run a small test against a "out-of-the-box" repository and found 99% of the 
binaries are in a jcr:data property, and a mime type is available. This might 
not be the case for all repositories. The mix of mime types probably varies 
even more; in my case, over 90% were from 6 mime types (application/zip, 
application/java-archive, image/png, application/javascript, image/jpeg, 
text/css).

> IMO we should still allow to tweak between best performance and best 
> compression

Yes, that makes sense!

A global switch "compress everything regardless" sounds easy.

A more complex solution would be to use a list of configurable mime types to 
_never_ compress, probably application/zip, application/java-archive, 
image/png, image/jpeg, video/mp4 or so. And for the rest a threshold, at which 
point to compress (an extreme value means compress everything else).

Regards,
Thomas


From: <maret.timot...@gmail.com> on behalf of Timothée Maret 
<timothee.ma...@gmail.com>
Reply-To: "dev@jackrabbit.apache.org" <dev@jackrabbit.apache.org>
Date: Tuesday, 7 March 2017 at 14:28
To: "dev@jackrabbit.apache.org" <dev@jackrabbit.apache.org>
Subject: Re: [FileVault][discuss] performance improvement proposal

Hi Thomas,

2017-03-07 11:27 GMT+01:00 Thomas Mueller 
<muel...@adobe.com<mailto:muel...@adobe.com>>:
Hi,

> As for configuration: What is the reason for having a configuration option ?

Detecting if data is compressible can be done with low overhead, without having 
to look at the content type, and without having to use configuration options:

http://stackoverflow.com/questions/7027022/how-to-efficiently-predict-if-data-is-compressible

Sample code is available in one of the answers ("I implemented a few methods to 
test if data is compressible…"). It is quite simple, and only needs to process 
256 bytes. Both the "Partial Entropy" and the "Simplified Compression" work 
relatively well.

This is not designed to be a "perfect" solution for the problem. It's a 
low-overhead heuristic, that will reduce the compression overhead on the 
average.

This sounds very nice :-) we could indeed drop the list of MIME type 
configuration.

IMO we should still allow to tweak between best performance and best 
compression though, in order to accommodate different use cases.
I thought about covering the two aspects in JCRVLT-163, but now changed the 
focus of JCRVLT-163 on avoiding compressing binaries (with or without 
auto-detection) and created JCRVLT-164 for allowing to tweak the default 
compression level.


Regards,

Timothee


Regards,
Thomas




Am 06.03.2017 um 16:43 schrieb Timothée Maret 
<timothee.ma...@gmail.com<mailto:timothee.ma...@gmail.com>>:

Hi,

With Sling content distribution (using FileVault), we observe a significantly 
lower throughput for content packages containing binaries.
The main bottleneck seems to be the compression algorithm applied to every 
element contained in the content package.

I think that we could improve the throughput significantly, simply by avoiding 
to re-compress binaries that are already compressed.
In order to figure out what binaries are already compressed, we could use match 
the content type stored along the binary against a list of configurable content 
types.

I have done some micro tests with this idea (patch in [0]). I think that the 
results are promising.

Exporting a single 250 MB JPEG is 80% faster (22.4 sec -> 4.3 sec) for a 3% 
bigger content package (233.2 MB -> 240.4 MB)
Exporting AEM OOTB /content/dam is 50% faster (11.9 sec -> 5.9 sec) for a 5% 
bigger content package (92.8 MB -> 97.4 MB)
Import for the same cases is 66% faster respectively 32% faster.

I think this could either be done by default and allowing to configure the list 
of types that skip compression.
Alternatively, it could be done on a project level, by extending FileVault with 
the following

1. For each package, allow to define the default compression level (best 
compression, best speed)
2. Expose an API that allow to plugin a custom logic to decide how to compress 
a given artefact

In any case, the changes would be backward compatible. Content packages created 
with the new code would be installable on instances running the old code and 
vice versa.

wdyt ?

Regards,

Timothee


[0] 
https://github.com/tmaret/jackrabbit-filevault/tree/performance-avoid-compressing-already-compressed-binaries-based-on-content-type-detection
[1] 
https://docs.oracle.com/javase/7/docs/api/java/util/zip/Deflater.html#BEST_SPEED





--
Timothée Maret

Reply via email to