Daniel Jewell <danieljew...@gmail.com> added the comment:

In playing with Lib/zipfile.py and Lib/zipimport.py, I noticed that zipfile has 
supported opportunistic loading of bz2/lzma for ~9 years. However, zipimport 
assumes only zlib will be used. (Yet, zipfile.PyZipFile will happily create 
zlib/bz2/lzma ZIP archives ... zipfile.PyZipFile('mod.zip', 'w', 
compression=zipfile.ZIP_LZMA) for example.)

At first I wondered why zipimport essentially duplicates a lot from zipfile but 
then realized (after reading some of the commit messages around the pure-python 
rewrite of zipimport a few years ago) that since zipimport is called as part of 
startup, there's a need to avoid importing certain modules. 

I'm wondering if this specific issue with zipimport is possibly more of an 
indicator of a larger issue? 

Specifically:

* The duplication of code between zipfile and zipimport seems like a potential 
source of bugs - I get the rationale but perhaps the "base" ZIP functionality 
ought to be refactored out of both zipimport and zipfile so they can share... 
And I mean the low-level stuff (compressor, checksum, etc.). Zipfile definitely 
imports more than zipimport but I haven't looked at what those imports are 
doing extensively. 

Ultimately: the behavior of the new ZipImport appears to be, essentially, the 
same as zipimport.c:

Per PEP-302 [https://www.python.org/dev/peps/pep-0302/], zipimport.zipimporter 
gets registered into sys.path_hooks. When you import anything in a zip file, 
all of the paths get cached into sys.path_importer_cache as 
zipimport.zipimporter objects.

The zipimporter objects, when instantiated, run zipimport._read_directory() 
which returns a low level dict with each key being a filename (module) and each 
value being a tuple of low-level metadata about that file including the byte 
offset into the zip file, time last modified, CRC, etc. (See zipimport.py:330 
or so). This is then stored in zipimporter._files.

Critically, the contents of the zip file are not decompressed at this stage: 
only the metadata of what is in the zip file and (most importantly) where it 
is, is stored in memory: only when a module is actually called for loading is 
the data read utilizing the cached metadata. There appears to be no provision 
for (a) verifying that the zip file itself hasn't changed or (b) refreshing the 
metadata. So it's no surprise really that this error is happening: the cached 
file contents metadata instructs zipimporter to decompress a specific byte 
offset in the zip file *when an import is called*. If the zip file changes on 
disk between the metadata scan (e.g. first read of the zip file) and actual 
loading, bam: error.

There appear to be several ways to fix this ... I'm not sure of the best:

* Possibly lock the ZIP file on first import so it doesn't change (this 
presents many new issues)
* Rescan the ZIP before each import; but the point of caching the contents 
appears to be the avoidance of this
* Hash the entire file and compare (expensive CPU-wise)
* Rely on modified time? e.g. cache the whole zip modified time at read and 
then if that's not the same invalidate the cache and rescan
* Cache the entire zip file into memory at first load - this has some 
advantages (can store the ZIP data compressed; would make the import all or 
nothing; faster?) But then there would need to be some kind of variable to 
limit the size/total size - it becomes a memory hog...

----------
nosy: +danieljewell

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue19081>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to