[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

Ruben Vorderman Tue, 23 Mar 2021 23:27:39 -0700


New submission from Ruben Vorderman <r.h.p.vorder...@lumc.nl>:


When working on python-isal which aims to provide faster drop-in replacements 
for the zlib and gzip modules I found that the gzip.compress and 
gzip.decompress are suboptimally implemented which hurts performance.

gzip.compress and gzip.decompress both do the following things:
- Instantiate a BytesIO object to mimick a file
- Instantiate a GzipFile object to compress or read the file.

That means there is way more Python code involved than strictly necessary. Also 
the 'data' is already fully in memory, but the data is streamed anyway. That is 
quite a waste.

I propose the following:
- The documentation should make it clear that zlib.decompress(... ,wbits=31) 
and zlib.compress(..., wbits=31) (after 43612 has been addressed), are both 
quicker but come with caveats. zlib.compress can not set mtime. zlib.decompress 
does not take multimember gzip into account. 
- For gzip.compress -> The GzipFile._write_gzip_header function should be moved 
to a module wide _gzip_header function that returns a bytes object. 
GzipFile._write_gzip_header can call this function. gzip.compress can also call 
this function to create a header. gzip.compress than calls zlib.compress(data, 
wbits=-15) (after 43612 has been fixed) to create a raw deflate block. A gzip 
trailer can be easily created by calling zlib.crc32(data) and len(data) & 
0xffffffff and packing those into a struct. See for an example implementation 
here: 
https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L242
-> For gzip.decompress it becomes quite more involved. A read_gzip_header 
function can be created, but the current implementation returns EOFErrors if 
the header is incomplete due to a truncated file instead of BadGzipFile errors. 
This makes it harder to implement something that is not a major break from 
current gzip.decompress. Apart from the header, the implementation is 
straightforward. Do a while true loop. All operations are performed in the 
loop. Validate the header and report the end of the header. Create a 
zlib.decompressobj(wbits=-15). Decompress all the data from the end of header. 
Flush. Extract the crc and length from the first 8 bytes of the unused data. 
data = decompobj.unused_data[8:]. if not data: break. For a reference 
implementation check here: 
https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L300.
 Note that the decompress function is quite straightforward. Checking the 
header however while maintaining backwards compatibility with gzip.deco
 mpress is not so simple.

And that brings to another point. Should non-descriptive EOFErrors be raised 
when reading the gzip header? Or throw informative BadGzipFile errors when the 
header is parsed. I tend towards the latter. For example BadGzipFile("Truncated 
header") instead of EOFError. Or at least EOFError("Truncated gzip header"). I 
am aware that confounds this issue with another issue, but these things are 
coupled in the implementation so both need to be solved at the same time.

Given the headaches that gzip.decompress gives it might be easier to solve 
gzip.compress first in a first PR and do gzip.decompress later.

----------
messages: 389438
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: gzip.compress and gzip.decompress are sub-optimally implemented.

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43613>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

Reply via email to