On 2020-05-11 Andrey Ponomarenko wrote:
> I need to maximize compression level of a large set of similar
> tarballs by xz. Is it possible to somehow export a common dictionary
> from a subset of tarballs and reuse it it when
> compressing/decompressing others?

The .xz format doesn't currently support specifying an external
dictionary.

liblzma has a preset dictionary feature which can be used for custom
file formats. The preset dictionary feature is kind of half-done as
there is no function to clone the encoder state, so compressing many
files with a big preset dictionary wastes CPU time in re-analyzing the
external dictionary for each file. There is no dictionary builder that
would analyze multiple files and figure out a good initial dictionary
common to all files.

The existing preset dictionary code could be used to implement external
dictionary support in the .xz format but there are existing solutions
to your problem that are likely as good or better. For example, xdelta3
or zstd (with its external dictionary feature) could be fine.

For example, let's say there is latest.tar and multiple old*.tar files.
With xdelta3:

    for I in old*.tar; do
        xdelta3 -9 -s latest.tar "$I" "$I.delta"
    done

With zstd:

    zstd -19 -D latest.tar old*.tar

Use "zstd --ultra -22" for better compression. zstd has a training
function to build a good external dictionary but it's meant for
compressing tiny files. With a quick try I didn't get good results with
with megabyte-sized files but perhaps I don't know how to use it
correctly. Using a single .tar as a dictionary worked great though.

In both cases you obviously need latest.tar to decompress the
old*.tar{.delta,.zst} files.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply via email to