New submission from David Wilson:
There is some really funky behaviour in the zipfile module, where, depending on
whether zipfile.ZipFile() is passed a string filename or a file-like object,
one of two things happens:
a) Given a file-like object, zipfile does not (since it cannot) consume excess
file descriptors on each call to '.open()', however simultaneous calls to
.open() the zip file's members (from the same thread) will produce file-like
objects for each member that appear intertwingled in some unfortunate manner:
Traceback (most recent call last):
File "my.py", line 23, in <module>
b()
File "my.py", line 18, in b
m.readline()
File
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py",
line 689, in readline
return io.BufferedIOBase.readline(self, limit)
File
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py",
line 727, in peek
chunk = self.read(n)
File
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py",
line 763, in read
data = self._read1(n)
File
"/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/zipfile.py",
line 839, in _read1
data = self._decompressor.decompress(data, n)
zlib.error: Error -3 while decompressing data: invalid stored block lengths
b) Given a string filename, simultaneous use of .open() produces a new file
descriptor for each opened member, which does not result in the above error,
but triggers an even worse one: file descriptor exhaustion given a sufficiently
large zip file.
This tripped me up rather badly last week during consulting work, and I'd like
to see both these behaviours fixed somehow. The ticket is more an RFC to see if
anyone has thoughts on how this fix should happen; it seems to me a no-brainer
that, since the ZIP file format fundamentally always requires a seekable file,
that in both the "constructed using file-like object" case, and the
"constructed using filename" case, we should somehow reuse the sole file object
passed to us to satisfy all reads of compressed member data.
It seems the problems can be fixed in both cases without damaging interface
semantics by simply tracking the expected 'current' read offset in each
ZipExtFile instance. Prior to any read, we simply call .seek() on the file
object prior to performing any .read().
Of course the result would not be thread safe, but at least in the current
code, ZipExtFile for a "constructed from a file-like object" edition zipfile is
already not thread-safe. With some additional work, we could make the module
thread-safe in both cases, however this is not the current semantic and doesn't
appear to be guaranteed by the module documentation.
---
Finally as to why you'd want to simultaneously open huge numbers of ZIP
members, well, ZIP itself easily supports streamy reads, and ZIP files can be
quite large, even larger than RAM. So it should be possible, as I needed last
week, to read streamily from a large number of members.
---
The attached my.zip is sufficient to demonstrate both problems.
The attached my.py has function a() to demonstrate the FD leak and b() to
demonstrate the interwingly state.
----------
components: Library (Lib)
files: mymy.zip
messages: 230987
nosy: dw
priority: normal
severity: normal
status: open
title: zipfile simultaneous open broken and/or needlessly(?) consumes
unreasonable number of file descriptors
versions: Python 3.4
Added file: http://bugs.python.org/file37171/mymy.zip
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue22842>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com