New submission from Ruben Vorderman <r.h.p.vorder...@lumc.nl>:

The gzip file format is quite ubiquitous and so is its first (?) free/libre 
implementation zlib with the gzip command line tool. This uses the DEFLATE 
algorithm.

Lately some faster algorithms (most notable zstd) have popped up which have 
better speed and compression ratio vs zlib. Unfortunately switching over to 
zstd will not be seemless. It is not compatible with zlib/gzip in any way. 

Luckily some developers have tried to implement DEFLATE in a faster way. Most 
notably libdeflate (https://github.com/ebiggers/libdeflate) and Intel's storage 
acceleration library (https://github.com/intel/isa-l).

These libraries provide the libdeflate-gzip and igzip utilities respectively. 
These can compress and decompress the same gzip files. An igzip compressed file 
can be read with gzip and vice versa.

To give an idea of the speed improvements that can be obtained. Here are some 
benchmarks. All benchmarks were done using hyperfine 
(https://github.com/sharkdp/hyperfine). The system was a Ryzen 5 3600 with 
2x16GB DDR4-3200 memory. Operating system Debian 10. All benchmarks were 
performed on a tmpfs which lives in memory to prevent IO bottlenecks. The test 
file was a 5 million read FASTQ file of 1.6 GB 
(https://en.wikipedia.org/wiki/FASTQ_format). These type of files are common in 
bioinformatics at 100+ GB sizes so are a good real-world benchmark.

I benchmarked pigz on one thread as well, as it implements zlib but in a faster 
way than gzip. Zstd was benchmarked as a comparison.

Versions: 
gzip 1.9 (provided by debian)
pigz 2.4 (provided by debian)
igzip 2.25.0 (provided by debian)
libdeflate-gzip 1.6 (compiled by conda-build with the recipe here: 
https://github.com/conda-forge/libdeflate-feedstock/pull/4)
zstd 1.3.8 (provided by debian)

By default level 1 is chosen for all compression benchmarks. Time is average 
over 10 runs.

COMPRESSION
program            time           size   memory
gzip               23.5 seconds   657M   1.5M
pigz (one thread)  22.2 seconds   658M   2.4M
libdeflate-gzip    10.1 seconds   623M   1.6G (reads entire file in memory)
igzip              4.6 seconds    620M   3.5M
zstd (to .zst)     6.1 seconds    584M   12.1M

Decompression. All programs decompressed the file created using gzip -1. (Even 
zstd which can also decompress gzip).

DECOMPRESSION
program            time           memory
gzip               10.5 seconds   744K
pigz (one-thread)  6.7 seconds    1.2M
libdeflate-gzip    3.6 seconds    2.2G (reads in mem before writing)
igzip              3.3 seconds    3.6M
zstd (from .gz)    6.4 seconds    2.2M
zstd (from .zst)   2.3 seconds    3.1M

As shown from the above benchmarks, using Intel's Storage Acceleration 
Libraries may improve performance quite substantially. Offering very fast 
compression and decompression. This gets igzip in the zstd ballpark in terms of 
speed while still offering backwards compatibility with gzip.

Intel's Storage Acceleration Libraries (isa-l) come with a bsd-3-clause 
license, so there should be no licensing issues when using that code inside of 
CPython.

----------
components: Library (Lib)
messages: 375533
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Include much faster DEFLATE implementations in Python's gzip and zlib 
libraries. (isa-l)
versions: Python 3.10

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue41566>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to