[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

John Goerzen Mon, 25 Nov 2019 20:28:57 -0800


John Goerzen <jgoer...@users.sourceforge.net> added the comment:


Hi Jon,

I've read your article in the gist, the ZIP spec, and the article you linked 
to.  As the article you linked to 
(https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/) states, 
"Implementers just encode file names however they want (usually byte for byte 
as they are in the OS".  That is certainly my observation.  CP437 has NEVER 
been guaranteed, *even on DOS*.  See 
https://en.wikipedia.org/wiki/Category:DOS_code_pages and 
https://www.aivosto.com/articles/charsets-codepages-dos.html for details on DOS 
code pages.  I do not recall any translation between DOS codepages being done 
in practice, or even possible - since the whole point of multiple codepages was 
the need for more than 256 symbols.  So (leaving aside utf-8 encodings for a 
second) no operating system or ZIP implementation I am aware of performs a 
translation to cp437, such translation is often not even possible, and they're 
just copying literal bytes to ZIP -- as the POSIX filesystem itself is.

So, from the above paragraph, it's clear that the assumption in zipfile that 
cp437 is in use is faulty.  Your claim that Python "fixes" a problem is also 
faulty.  Converting from a latin-1 character, using a cp437 codeset, and 
generating a filename with that cp437 character represented as a Unicode code 
point is wrong in many ways.  Python should not take an opinion on this; it 
should be agnostic and copy the bytes that represent the filename in the ZIP to 
bytes that represent the filename on the filesystem.

POSIX filenames contain any of 254 characters (only 0x00 and '/' are invalid).  
The filesystem is encoding-agnostic; POSIX filenames are just stream of bytes.  
There is no alternative but to treat ZIP filenames (without the Unicode flag) 
the same way.  Copy bytes to bytes.  It is not possible to identify the 
encoding of the filename in the absence of the Unicode flag.

zipfile should:

1) expose a bytes interface to filename
2) use byte-for-byte extraction when no Unicode flag is present
3) not make the assumption that cp437 was the original encoding

Your proposal only "works" cross-platform because it is broken on every 
platform!

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38861>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

Reply via email to