Daniel Jewell <[email protected]> added the comment:
In playing with Lib/zipfile.py and Lib/zipimport.py, I noticed that zipfile has
supported opportunistic loading of bz2/lzma for ~9 years. However, zipimport
assumes only zlib will be used. (Yet, zipfile.PyZipFile will happily create
zlib/bz2/lzma ZIP archives ... zipfile.PyZipFile('mod.zip', 'w',
compression=zipfile.ZIP_LZMA) for example.)
At first I wondered why zipimport essentially duplicates a lot from zipfile but
then realized (after reading some of the commit messages around the pure-python
rewrite of zipimport a few years ago) that since zipimport is called as part of
startup, there's a need to avoid importing certain modules.
I'm wondering if this specific issue with zipimport is possibly more of an
indicator of a larger issue?
Specifically:
* The duplication of code between zipfile and zipimport seems like a potential
source of bugs - I get the rationale but perhaps the "base" ZIP functionality
ought to be refactored out of both zipimport and zipfile so they can share...
And I mean the low-level stuff (compressor, checksum, etc.). Zipfile definitely
imports more than zipimport but I haven't looked at what those imports are
doing extensively.
Ultimately: the behavior of the new ZipImport appears to be, essentially, the
same as zipimport.c:
Per PEP-302 [https://www.python.org/dev/peps/pep-0302/], zipimport.zipimporter
gets registered into sys.path_hooks. When you import anything in a zip file,
all of the paths get cached into sys.path_importer_cache as
zipimport.zipimporter objects.
The zipimporter objects, when instantiated, run zipimport._read_directory()
which returns a low level dict with each key being a filename (module) and each
value being a tuple of low-level metadata about that file including the byte
offset into the zip file, time last modified, CRC, etc. (See zipimport.py:330
or so). This is then stored in zipimporter._files.
Critically, the contents of the zip file are not decompressed at this stage:
only the metadata of what is in the zip file and (most importantly) where it
is, is stored in memory: only when a module is actually called for loading is
the data read utilizing the cached metadata. There appears to be no provision
for (a) verifying that the zip file itself hasn't changed or (b) refreshing the
metadata. So it's no surprise really that this error is happening: the cached
file contents metadata instructs zipimporter to decompress a specific byte
offset in the zip file *when an import is called*. If the zip file changes on
disk between the metadata scan (e.g. first read of the zip file) and actual
loading, bam: error.
There appear to be several ways to fix this ... I'm not sure of the best:
* Possibly lock the ZIP file on first import so it doesn't change (this
presents many new issues)
* Rescan the ZIP before each import; but the point of caching the contents
appears to be the avoidance of this
* Hash the entire file and compare (expensive CPU-wise)
* Rely on modified time? e.g. cache the whole zip modified time at read and
then if that's not the same invalidate the cache and rescan
* Cache the entire zip file into memory at first load - this has some
advantages (can store the ZIP data compressed; would make the import all or
nothing; faster?) But then there would need to be some kind of variable to
limit the size/total size - it becomes a memory hog...
----------
nosy: +danieljewell
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue19081>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com