[issue40172] ZipInfo corrupts file names in some old zip archives
Daniel Hillier added the comment: Related to issue https://bugs.python.org/issue28080 which has a patch that covers a bit of this issue -- ___ Python tracker <https://bugs.python.org/issue40172> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue45981] Get raw file name in bytes from ZipFile
Daniel Hillier added the comment: Handling different character sets is not completely supported yet. There are a couple of open issues relating to this: https://bugs.python.org/issue40407 (reading file names), https://bugs.python.org/issue41928 (support for reading and writing filenames using the unicode filename extra field) and https://bugs.python.org/issue40172 (issues with reading and then writing a filename from and back into a zip where the initial filename isn't encoded in cp437). Most modern zip programs that deal with characters outside ascii or cp437 either set the utf-8 flag or write both an ascii or cp437 compatible filename (to the original filename field in the zip header) and the actual filename with all non-ascii characters in the unicode filename extra field. I think adding support for the unicode field to Python would probably cover the majority files generated by modern zip programs. For complete support, including older zip programs that don't support the utf-8 flag or unicode filename extra field, we may need to provide another parameter in Python's ZipFile's read and write functions to be able to override the charset used for the filename stored directly in the zip file header. I've added my thoughts on how to approach this in https://bugs.python.org/issue40172 but haven't had time to implement these myself. -- nosy: +dhillier ___ Python tracker <https://bugs.python.org/issue45981> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39359] zipfile: add missing "pwd: expected bytes, got str" exception message
Daniel Hillier added the comment: I agree it is bad form but I would accidentally do it when I couldn't remember the proper API and took a stab in the dark without looking up the docs. I unfortunately used it in an example in the docs for pyzipper and started getting a few bug reports even after fixing my docs. -- ___ Python tracker <https://bugs.python.org/issue39359> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39359] zipfile: add missing "pwd: expected bytes, got str" exception message
Daniel Hillier added the comment: Thanks Łukasz! -- ___ Python tracker <https://bugs.python.org/issue39359> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40172] ZipInfo corrupts file names in some old zip archives
Daniel Hillier added the comment: Looking into this more and it appears that while Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page. https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: https://bugs.python.org/issue41928 and that issue may be related to this one https://bugs.python.org/issue40407 For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field. For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii) -- ___ Python tracker <https://bugs.python.org/issue40172> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40172] ZipInfo corrupts file names in some old zip archives
Daniel Hillier added the comment: zipfile decodes filenames using cp437 or unicode and encodes using ascii or unicode. It seems like zipfile has a preference for writing filenames in unicode rather than cp437. Is zipfile's preference for writing filenames in unicode rather than cp437 intentional? Is the bug you're seeing related to using zipfile to open and rewrite old zips and not being able to open the rewritten files in an old program that doesn't support the unicode flag? We could address this two ways: - Change ZipInfo._encodeFilenameFlags() to always encode to cp437 if possible - Add a flag to write filenames in cp437 or unicode, otherwise the current situation of ascii or unicode I guess the choice will depend on if preferring unicode rather than cp437 is intentional and if writing filenames in cp437 will break anything (it shouldn't break anything according to Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT) Here's a test for your current patch (I'd probably put it alongside OtherTests.test_read_after_write_unicode_filenames as this test was adapted from that one) class OtherTests(unittest.TestCase): ... def test_read_after_write_cp437_filenames(self): fname = 'test_cp437_é' with zipfile.ZipFile(TESTFN2, 'w') as zipfp: zipfp.writestr(fname, b'sample') with zipfile.ZipFile(TESTFN2) as zipfp: zinfo = zipfp.infolist()[0] # Ensure general purpose bit 11 (Language encoding flag # (EFS)) is unset to indicate the filename is not unicode self.assertFalse(zinfo.flag_bits & 0x800) self.assertEqual(zipfp.read(fname), b'sample') -- nosy: +dhillier ___ Python tracker <https://bugs.python.org/issue40172> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44129] zipfile: Add descriptive global variables for general purpose bit flags
Change by Daniel Hillier : -- keywords: +patch pull_requests: +24763 stage: -> patch review pull_request: https://github.com/python/cpython/pull/26118 ___ Python tracker <https://bugs.python.org/issue44129> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44129] zipfile: Add descriptive global variables for general purpose bit flags
New submission from Daniel Hillier : In the zipfile module, masking of bit flags is done against hex numbers eg. if flags & 0x800... To increase readability I suggest we replace these with global variables named for the purpose of the flag. From the example above: if flags & 0x800 becomes: if flags & _MASK_UTF_FILENAME -- components: Library (Lib) messages: 393622 nosy: dhillier, serhiy.storchaka priority: normal severity: normal status: open title: zipfile: Add descriptive global variables for general purpose bit flags versions: Python 3.10, Python 3.11 ___ Python tracker <https://bugs.python.org/issue44129> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44128] zipfile: Deduplicate ZipExtFile code for init and resetting when seeking
Change by Daniel Hillier : -- keywords: +patch pull_requests: +24761 stage: -> patch review pull_request: https://github.com/python/cpython/pull/26116 ___ Python tracker <https://bugs.python.org/issue44128> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue44128] zipfile: Deduplicate ZipExtFile code for init and resetting when seeking
New submission from Daniel Hillier : Integrating a refactor suggested in https://bugs.python.org/issue38334 The logic for preparing a ZipExtFile for reading (setting CRC state, read positions, etc) is currently in two locations: first initialisation and when seeking back to the start of a file. This change moves the logic into the method `ZipExtFile.init_read()` -- components: Library (Lib) messages: 393619 nosy: dhillier, serhiy.storchaka priority: normal severity: normal status: open title: zipfile: Deduplicate ZipExtFile code for init and resetting when seeking type: enhancement versions: Python 3.10, Python 3.11 ___ Python tracker <https://bugs.python.org/issue44128> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Daniel Hillier added the comment: Hi Massimo, Unless I'm missing something about your requirements, the advantage is that it already works in python 2.7 so there is no need to patch Python. Just bundle the above function with your analysis tool and you're good to go. Cheers, Dan On Sat, Apr 18, 2020 at 11:36 PM Massimo Sala wrote: > > Massimo Sala added the comment: > > Hi Daniel > > Could you please elaborate the advantages of your loop versus my two lines > of code? > I don't grasp... > > Thanks, Massimo > > On Sat, 18 Apr 2020 at 03:26, Daniel Hillier > wrote: > > > > > Daniel Hillier added the comment: > > > > Could something similar be achieved by looking for the earliest file > > header offset? > > > > def find_earliest_header_offset(zf): > > earliest_offset = None > > for zinfo in zf.infolist(): > > if earliest_offset is None: > > earliest_offset = zinfo.header_offset > > else: > > earliest_offset = min(zinfo.header_offset, earliest_offset) > > return earliest_offset > > > > > > You could also adapt this using > > > > zinfo.compress_size + len(zinfo.FileHeader()) > > > > to see if there were any sections inside the archive which were not > > referenced from the central directory. Not sure if zip files with > arbitrary > > bytes inside the archive would be valid everywhere, but I think they are > > using zipfile. > > > > You can also have zipped content inside an archive which has a valid > > fileheader but no reference from the central directory. Those entries are > > discoverable by implementations which process content serially from the > > start of the file but not implementations which rely on the central > > directory. > > > > -- > > nosy: +dhillier > > > > ___ > > Python tracker > > <https://bugs.python.org/issue40301> > > ___ > > > > -- > > ___ > Python tracker > <https://bugs.python.org/issue40301> > ___ > -- ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue40301] zipfile module: new feature (two lines of code), useful for test, security and forensics
Daniel Hillier added the comment: Could something similar be achieved by looking for the earliest file header offset? def find_earliest_header_offset(zf): earliest_offset = None for zinfo in zf.infolist(): if earliest_offset is None: earliest_offset = zinfo.header_offset else: earliest_offset = min(zinfo.header_offset, earliest_offset) return earliest_offset You could also adapt this using zinfo.compress_size + len(zinfo.FileHeader()) to see if there were any sections inside the archive which were not referenced from the central directory. Not sure if zip files with arbitrary bytes inside the archive would be valid everywhere, but I think they are using zipfile. You can also have zipped content inside an archive which has a valid fileheader but no reference from the central directory. Those entries are discoverable by implementations which process content serially from the start of the file but not implementations which rely on the central directory. -- nosy: +dhillier ___ Python tracker <https://bugs.python.org/issue40301> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39294] zipfile.ZipInfo objects contain invalid 'extra' fields.
Daniel Hillier added the comment: This looks to be expected behaviour for the zip64 extension in the zip spec (for handling large files or large archives). Section 4.4.1.4 of the zip spec outlines when the zip64 extra fields are used (https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT). In short, when the file offset header (number of bytes to the start of the file in the archive) exceeds the size allotted in the header in the original spec (0x or just under 2Gb). Let me know if what you're observing is unrelated to this. -- nosy: +dhillier ___ Python tracker <https://bugs.python.org/issue39294> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39359] zipfile: add missing "pwd: expected bytes, got str" exception message
Change by Daniel Hillier : -- keywords: +patch pull_requests: +17427 stage: -> patch review pull_request: https://github.com/python/cpython/pull/18031 ___ Python tracker <https://bugs.python.org/issue39359> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue39359] zipfile: add missing "pwd: expected bytes, got str" exception message
New submission from Daniel Hillier : Setting the ZipFile.pwd attribute directly skips the check to ensure the password is a bytes object and, if not, return a user friendly TypeError("pwd: expected bytes, got ") informing them of that. -- components: Library (Lib) messages: 360118 nosy: dhillier priority: normal severity: normal status: open title: zipfile: add missing "pwd: expected bytes, got str" exception message type: enhancement versions: Python 3.7, Python 3.8, Python 3.9 ___ Python tracker <https://bugs.python.org/issue39359> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37523] zipfile: Raise ValueError for i/o operations on closed zipfile.ZipExtFile
Daniel Hillier added the comment: Good point. Thanks for the advice. I've updated it to use timeit. Does that give a better indication? import zipfile test_zip = "time_test.zip" test_name = "test_name.txt" # with zipfile.ZipFile(test_zip, "w") as zf: # zf.writestr(test_name, "Hi there! " * 300) def test(): with zipfile.ZipFile(test_zip) as zf: for i in range(10): zf.read(test_name) if __name__ == "__main__": import timeit print(timeit.repeat("test()", setup="from __main__ import test", number=1, repeat=10)) On my machine (running a usual workload) the best of 10 was: master: 3.812s check closed: 3.874s But I think my machine had a quite a bit of variance between runs. -- ___ Python tracker <https://bugs.python.org/issue37523> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37523] zipfile: Raise ValueError for i/o operations on closed zipfile.ZipExtFile
Daniel Hillier added the comment: Here's the script I used for profiling and the results I observed with and without the closed check in read: import zipfile test_zip = "time_test.zip" test_name = "test_name.txt" with zipfile.ZipFile(test_zip, "w") as zf: zf.writestr(test_name, "Hi there! " * 300) with zipfile.ZipFile(test_zip) as zf: for i in range(10): zf.read(test_name) # Current code (no closed check), three different profiling sessions: # ncalls tottime percall cumtime percall filename:lineno(function) # 100.6120.0006.6380.000 zipfile.py:884(read) # 100.5980.0006.4890.000 zipfile.py:884(read) # 100.6000.0006.4850.000 zipfile.py:884(read) # With closed check, three different profiling sessions: # ncalls tottime percall cumtime percall filename:lineno(function) # 100.6320.0006.5870.000 zipfile.py:884(read) # 100.6230.0006.5640.000 zipfile.py:884(read) # 100.6380.0006.7000.000 zipfile.py:884(read) --- I based this change on the what BytesIO does: https://github.com/python/cpython/blob/master/Lib/_pyio.py#L912 Let me know if you want me to make any changes. -- ___ Python tracker <https://bugs.python.org/issue37523> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
Daniel Hillier added the comment: Thanks for your help! Good point, I'll create a new change for the refactoring. -- ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
Daniel Hillier added the comment: I also think that the `read_init` method in my PR is a useful refactor as it locates all the state that needs to be (re)set when starting a read into the same location. At the moment this state is set in 1) __init__ and 2) the seek method when seeking back beyond what is in the buffer. -- ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
Daniel Hillier added the comment: Thanks for looking at the PR. I got carried away refactoring the decrypter for a future scenario where there could be different decrypters (possibly using certificates too) :) Your PR is much simpler. Would you also be able to take a look at some other PRs I've submitted for zipfile. They are both pretty small changes: https://bugs.python.org/issue36993 https://bugs.python.org/issue37523 Thanks again, Dan -- ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
Daniel Hillier added the comment: Hi, I have another patch I would like to contribute to the zipfile module but would like to request a review of this one to minimise conflicts with later patches. If anyone is able to review the patch, I would really appreciate it :) Also, with regards to selecting Python versions for this issue, is there any rule of thumb I should be following? Thanks, Dan -- ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
Change by Daniel Hillier : -- nosy: +serhiy.storchaka ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
Change by Daniel Hillier : -- keywords: +patch pull_requests: +16120 stage: -> patch review pull_request: https://github.com/python/cpython/pull/16529 ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue38334] zipfile: Seeking encrypted file breaks after seeking backwards
New submission from Daniel Hillier : Seeking back beyond the decrypted / unzipped buffer doesn't reset the decrypter's crc key values. All data read after seeking back beyond the buffer is garbled. -- components: Library (Lib) messages: 353646 nosy: dhillier priority: normal severity: normal status: open title: zipfile: Seeking encrypted file breaks after seeking backwards type: enhancement versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue38334> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37538] Refactor zipfile to ease subclassing and enhancement
Change by Daniel Hillier : -- keywords: +patch pull_requests: +14725 stage: -> patch review pull_request: https://github.com/python/cpython/pull/14957 ___ Python tracker <https://bugs.python.org/issue37538> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37538] Refactor zipfile to ease subclassing and enhancement
Daniel Hillier added the comment: Hi, Here is a pull request against my fork: https://github.com/danifus/cpython/pull/1/files The overall behaviour of zipfile remains the same and I've tried to call out any behaviour changes in the extended commit messages (usually with ** markers). There is another branch (https://github.com/danifus/cpython/tree/zipfile_refactor_2) which has a couple of extra commits which slightly changes some functionality. Any review would be greatly appreciated! Happy to make any changes to the code or general approach taken. Thanks, Dan -- ___ Python tracker <https://bugs.python.org/issue37538> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37538] Refactor zipfile to ease subclassing and enhancement
Daniel Hillier added the comment: I've started a branch on my github fork if anyone wants to follow along. https://github.com/danifus/cpython/tree/zipfile_refactor Is there a better way to manage this in terms of review and suggestions as I add more commits? -- ___ Python tracker <https://bugs.python.org/issue37538> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37538] Refactor zipfile to ease subclassing and enhancement
New submission from Daniel Hillier : I've written https://github.com/danifus/pyzipper which incorporates a refactor of zipfile.py in order to support winzip AES encryption. I don't intend to include the crypto code but I would like to incorporate the refactor to help others subclass and extend the objects found in zipfile.py For a longer description of the general approach I've taken to the refactor, see the python-ideas thread: https://mail.python.org/archives/list/python-id...@python.org/thread/QCKHI5JYMKOPMIF6Q2NI7JIFHV6KO746/#QCKHI5JYMKOPMIF6Q2NI7JIFHV6KO746 -- components: Library (Lib) messages: 347604 nosy: dhillier priority: normal severity: normal status: open title: Refactor zipfile to ease subclassing and enhancement type: enhancement versions: Python 3.9 ___ Python tracker <https://bugs.python.org/issue37538> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37523] zipfile: Raise ValueError for i/o operations on closed zipfile.ZipExtFile
Change by Daniel Hillier : -- keywords: +patch pull_requests: +14465 stage: -> patch review pull_request: https://github.com/python/cpython/pull/14658 ___ Python tracker <https://bugs.python.org/issue37523> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue37523] zipfile: Raise ValueError for i/o operations on closed zipfile.ZipExtFile
New submission from Daniel Hillier : After closing a file object opened from a ZipFile, attempting i/o operations raises AttributeError because the underlying fd has been set to None. We should be raising ValueErrors consistent with io.FileIO behaviour. Similar inconsistencies exist for the following operations on a closed ZipExtFile: - seek - seekable - read - readable - tell -- components: Library (Lib) messages: 347524 nosy: dhillier priority: normal severity: normal status: open title: zipfile: Raise ValueError for i/o operations on closed zipfile.ZipExtFile type: behavior versions: Python 3.7, Python 3.8 ___ Python tracker <https://bugs.python.org/issue37523> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36993] zipfile: tuple IndexError on extract
Daniel Hillier added the comment: I've pushed a PR which adds a test that generates corrupt zip64 files with different combinations of zip64 extra data lengths and zip64 flags (which determines how many fields are required in the extra data). It now raises a BadZipFile with a message naming the first missing field. -- nosy: +dhillier ___ Python tracker <https://bugs.python.org/issue36993> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue36993] zipfile: tuple IndexError on extract
Change by Daniel Hillier : -- pull_requests: +14463 pull_request: https://github.com/python/cpython/pull/14656 ___ Python tracker <https://bugs.python.org/issue36993> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com