On 2020-05-11 Andrey Ponomarenko wrote: > I need to maximize compression level of a large set of similar > tarballs by xz. Is it possible to somehow export a common dictionary > from a subset of tarballs and reuse it it when > compressing/decompressing others?
The .xz format doesn't currently support specifying an external dictionary. liblzma has a preset dictionary feature which can be used for custom file formats. The preset dictionary feature is kind of half-done as there is no function to clone the encoder state, so compressing many files with a big preset dictionary wastes CPU time in re-analyzing the external dictionary for each file. There is no dictionary builder that would analyze multiple files and figure out a good initial dictionary common to all files. The existing preset dictionary code could be used to implement external dictionary support in the .xz format but there are existing solutions to your problem that are likely as good or better. For example, xdelta3 or zstd (with its external dictionary feature) could be fine. For example, let's say there is latest.tar and multiple old*.tar files. With xdelta3: for I in old*.tar; do xdelta3 -9 -s latest.tar "$I" "$I.delta" done With zstd: zstd -19 -D latest.tar old*.tar Use "zstd --ultra -22" for better compression. zstd has a training function to build a good external dictionary but it's meant for compressing tiny files. With a quick try I didn't get good results with with megabyte-sized files but perhaps I don't know how to use it correctly. Using a single .tar as a dictionary worked great though. In both cases you obviously need latest.tar to decompress the old*.tar{.delta,.zst} files. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode