New submission from David Wilson:

There is some really funky behaviour in the zipfile module, where, depending on 
whether zipfile.ZipFile() is passed a string filename or a file-like object, 
one of two things happens:

a) Given a file-like object, zipfile does not (since it cannot) consume excess 
file descriptors on each call to '.open()', however simultaneous calls to 
.open() the zip file's members (from the same thread) will produce file-like 
objects for each member that appear intertwingled in some unfortunate manner:

Traceback (most recent call last):
  File "my.py", line 23, in <module>
    b()
  File "my.py", line 18, in b
    m.readline()
  File 
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", 
line 689, in readline
    return io.BufferedIOBase.readline(self, limit)
  File 
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", 
line 727, in peek
    chunk = self.read(n)
  File 
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", 
line 763, in read
    data = self._read1(n)
  File 
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py", 
line 839, in _read1
    data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths



b) Given a string filename, simultaneous use of .open() produces a new file 
descriptor for each opened member, which does not result in the above error, 
but triggers an even worse one: file descriptor exhaustion given a sufficiently 
large zip file.


This tripped me up rather badly last week during consulting work, and I'd like 
to see both these behaviours fixed somehow. The ticket is more an RFC to see if 
anyone has thoughts on how this fix should happen; it seems to me a no-brainer 
that, since the ZIP file format fundamentally always requires a seekable file, 
that in both the "constructed using file-like object" case, and the 
"constructed using filename" case, we should somehow reuse the sole file object 
passed to us to satisfy all reads of compressed member data.

It seems the problems can be fixed in both cases without damaging interface 
semantics by simply tracking the expected 'current' read offset in each 
ZipExtFile instance. Prior to any read, we simply call .seek() on the file 
object prior to performing any .read().

Of course the result would not be thread safe, but at least in the current 
code, ZipExtFile for a "constructed from a file-like object" edition zipfile is 
already not thread-safe. With some additional work, we could make the module 
thread-safe in both cases, however this is not the current semantic and doesn't 
appear to be guaranteed by the module documentation.

---

Finally as to why you'd want to simultaneously open huge numbers of ZIP 
members, well, ZIP itself easily supports streamy reads, and ZIP files can be 
quite large, even larger than RAM. So it should be possible, as I needed last 
week, to read streamily from a large number of members.

---

The attached my.zip is sufficient to demonstrate both problems.

The attached my.py has function a() to demonstrate the FD leak and b() to 
demonstrate the interwingly state.

----------
components: Library (Lib)
files: mymy.zip
messages: 230987
nosy: dw
priority: normal
severity: normal
status: open
title: zipfile simultaneous open broken and/or needlessly(?) consumes 
unreasonable number of file descriptors
versions: Python 3.4
Added file: http://bugs.python.org/file37171/mymy.zip

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue22842>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to