[issue40172] ZipInfo corrupts file names in some old zip archives

2022-03-22 Thread Gregory P. Smith


Gregory P. Smith  added the comment:

Examining Lib/zipfile.py code, the existing code makes sense. Python's zipfile 
module produces modern zipfiles when writing by setting the utf-8 flag and 
storing the filename as utf-8 when it is not ASCII.  This is desirable for use 
with all normal zip implementations in the past 10-15 years.

When decoding a zipfile, if the utf-8 flag is not set, we assume cp437 per the 
pkware zip appnotes.txt "spec".  So our reading is correct as well, even for 
very old files.

This is being strict in what we produce an lenient in what we accept.  caveats? 
 yes:

If someone does need to produce zipfiles for use with ancient software that 
does not support utf-8, that also does not identify the unknown utf-8 flag as 
an error condition, it will interpret the name in a corrupt manner for 
non-ascii names.

Similarly, even if written with cp437 names (as PR 19335 would do), in old zip 
system implementations where the implementation blindly uses the users locale 
encoding instead of cp437, it will always see corrupt data in that scenario. 
(aka mojibake?)

These are not what I'd expect to be normal use cases. Do you have a common 
practical example of a need for this?

(The PR on issue28080 provides a way to _read_ legacy zip files that used a 
codec other than cp437 if you know what it was.)

---

https://www.loc.gov/preservation/digital/formats/fdd/fdd000354.shtml may also 
be of interest regarding the zip format.

--
nosy: +gregory.p.smith

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2022-03-21 Thread Daniel Hillier


Daniel Hillier  added the comment:

Related to issue https://bugs.python.org/issue28080 which has a patch that 
covers a bit of this issue

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2022-03-05 Thread Yudi Levi


Yudi Levi  added the comment:

The main issue is that when extracting older zip files, files are actually 
written to disk with corrupted (altered) names.
Unfortunately it's been a while since I saw this issue and I can't tell if it 
was fixed or if I simply can't reproduce it.
I do see that encoding/decoding in ZipInfo is still inconsistent, sometimes 
uses ascii codepage and sometimes uses cp437 codepage which seems wrong to me.
Not sure how we should handle it but I think that switching the default ascii 
encoding to cp437 to be consistent with the old implementation (and with the 
filename decoding) seems like the right way to go.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2021-05-26 Thread Daniel Hillier


Daniel Hillier  added the comment:

Looking into this more and it appears that while Appendix D of 
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general 
purpose bit 11 is unset, the file name and comment SHOULD conform to the 
original ZIP character encoding" where the original encoding is IBM 437 
(cp437), this is not always followed. This isn't too surprising as cp437 
doesn't have every character for every language! In particular, some archive 
programs on windows will use the user's locale code page.

https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i

A UTF filename can be stored in the extra field 0x7075 in addition to a 
filename encoded in an arbitrary code page stored in the header's filename 
section. There is a open issue to add handling these fields (for reading) to 
zipfile: https://bugs.python.org/issue41928 and that issue may be related to 
this one https://bugs.python.org/issue40407

For this issue, with regards to encoding, I prefer the current situation where 
general purpose bit 11 for UTF is preferentially used because it doesn't change 
the behaviour compared to previous Python versions and it reduces file size as 
the filename isn't repeated in the extra field.

For compatibility with other archive programs that don't support the general 
purpose bit 11, I suggest we add an additional mechanism to allow the code page 
for the path name (and comment) to be set and use the 0x7075 extra field to 
store the UTF name in those cases where the filename can't be encoded in ascii 
(and 0x6075 to store the utf comment where it can't be encoded in ascii)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2021-05-24 Thread Daniel Hillier

Daniel Hillier  added the comment:

zipfile decodes filenames using cp437 or unicode and encodes using ascii or 
unicode. It seems like zipfile has a preference for writing filenames in 
unicode rather than cp437. Is zipfile's preference for writing filenames in 
unicode rather than cp437 intentional?

Is the bug you're seeing related to using zipfile to open and rewrite old zips 
and not being able to open the rewritten files in an old program that doesn't 
support the unicode flag?

We could address this two ways:
- Change ZipInfo._encodeFilenameFlags() to always encode to cp437 if possible
- Add a flag to write filenames in cp437 or unicode, otherwise the current 
situation of ascii or unicode

I guess the choice will depend on if preferring unicode rather than cp437 is 
intentional and if writing filenames in cp437 will break anything (it shouldn't 
break anything according to Appendix D of 
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)

Here's a test for your current patch (I'd probably put it alongside 
OtherTests.test_read_after_write_unicode_filenames as this test was adapted 
from that one)

class OtherTests(unittest.TestCase):
...

def test_read_after_write_cp437_filenames(self):
fname = 'test_cp437_é'
with zipfile.ZipFile(TESTFN2, 'w') as zipfp:
zipfp.writestr(fname, b'sample')

with zipfile.ZipFile(TESTFN2) as zipfp:
zinfo = zipfp.infolist()[0]
# Ensure general purpose bit 11 (Language encoding flag
# (EFS)) is unset to indicate the filename is not unicode
self.assertFalse(zinfo.flag_bits & 0x800)
self.assertEqual(zipfp.read(fname), b'sample')

--
nosy: +dhillier

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2021-05-16 Thread Yudilevi


Yudilevi  added the comment:

Hey :)

Sorry that I'm not responsive, just busy.
I'll add one soon.

Yudi

On Mon, May 17, 2021 at 12:08 AM Irit Katriel 
wrote:

>
> Irit Katriel  added the comment:
>
> Can you suggest a unit test for this?
>
> --
> nosy: +iritkatriel
>
> ___
> Python tracker 
> 
> ___
>

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2021-05-16 Thread Irit Katriel


Irit Katriel  added the comment:

Can you suggest a unit test for this?

--
nosy: +iritkatriel

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2020-04-03 Thread Yudi


Change by Yudi :


--
keywords: +patch
pull_requests: +18697
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/19335

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue40172] ZipInfo corrupts file names in some old zip archives

2020-04-03 Thread Yudi


New submission from Yudi :

Some old zip files that don't yet use unicode file names might have entries 
with characters beyond the ascii range.
ZipInfo seems to encode these file names with 'cp437' codepage (correct for old 
zips) but decode them back with 'ascii' code page which might corrupt them.

--
components: Library (Lib)
files: example.zip
messages: 365701
nosy: yudilevi
priority: normal
severity: normal
status: open
title: ZipInfo corrupts file names in some old zip archives
type: behavior
versions: Python 3.8
Added file: https://bugs.python.org/file49030/example.zip

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com