Martin Panter added the comment:

Ben: By adding the cap, it means when the lower-level zlib module is asked to 
decode say 5 GiB, it will decode at most 4 GiB (rather than raising an 
exception). The 4 GiB limit (UINT_MAX to be precise) is due to the underlying 
zlib library; it’s not Python’s doing, and the limit was present before 3.5 as 
well.

In 3.5, if you ask GzipFile.read() to decompress 5 GiB, it will actually feed 
the decompressor 8 KiB (io.DEFAULT_BUFFER_SIZE) of compressed data, which will 
decompress to at most 8 MB or so (max compression ratio is 1:1032). Then it 
will feed the next 8 KiB chunk. So it does not matter whether the limit is 4 or 
5 GiB because there can only be 8ish MB data per internal decompress() call.

Before 3.5, it first feeds a 1 KiB chunk, but exponentially increases the chunk 
size up to 10 MiB in each internal iteration.

If you think the new 3.5 implementation (with OverflowError fixed) is 
significantly less efficient, I can look at ways of improving it. But I am a 
bit skeptical. Usually I find once you process 10s or 100s of kB of data at 
once, any increases in the buffer size have negligible effect on the little 
time spent in native Python code, and times actually get worse as your buffers 
get too big, probably due to ineffective use of fast memory caching.

===

I propose this patch for the 3.5 branch, which implements the internal capping 
in the zlib module. It changes the zlib module to accept sizes greater than 
UINT_MAX, but internally limits them. I think this is the most practical way 
forward, because it does not require Python code to know the value of UINT_MAX 
(which could in theory vary by platform).

I would be good to get a review of my patch. In particular, is this change okay 
for the 3.5 maintainence branch? It would also be good to confirm that my patch 
fixes the original problem, i.e. read(2**32) works on computers with more than 
4 GiB of memory (which I cannot test).

This regression is caused by Issue 23529, where I make sure that 
GzipFile.read(size) doesn’t buffer huge amounts of data from a decompression 
bomb.

----------
keywords: +patch
stage:  -> patch review
Added file: http://bugs.python.org/file41053/zlib-size_t.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25626>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to