Devourer Station <accelerator0...@gmail.com> added the comment: In file Lib/zipfile.py: 1357> flags = centdir[5] 1358> if flags & 0x800: 1359> # UTF-8 file names extension 1360> filename = filename.decode('utf-8') 1361> else: 1362> # Historical ZIP filename encoding 1363> filename = filename.decode('cp437')
ZipFile simply decodes all non-utf8 file names by encoding CP437. In file Lib/zipfile.py: 352> # This is used to ensure paths in generated ZIP files always use 353> # forward slashes as the directory separator, as required by the 354> # ZIP format specification. 355> if os.sep != "/" and os.sep in filename: 356> filename = filename.replace(os.sep, "/") And it replaces every '\\' with '/' on windows. Consider we have a file named '\x97\x5c\x92\x9b', which is '予兆' in Japanese encoded in SHIFT_JIS. You may have noticed the problem: '\x5c' is '\\'(backslash) in ASCII So you will see ZipFile decodes the bytes by CP437, and replaces all '\\' with '/'. And the Japanese character '予' is replaced partially, it is no longer itself. Someone says we can replace '/' with '\\' back, and decode it by CP437 to get the raw bytes. But what if both '/'('\x2f') and '\\'('\x5c') appear in the raw filename? Simply replacing '\\' in a bytestream without knowning the encoding is by no means a good way. Maybe we can provide a rawname field in the ZipInfo struct? ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue45981> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com