New submission from John Goerzen <[email protected]>:
The zipfile.py standard library component contains a number of pieces of
questionable handling of non-UTF8 filenames. As the ZIP file format predated
Unicode by a significant number of years, this is actually fairly common with
older code.
Here is a very simple reproduction case.
mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t
0xf7 is the division sign in ISO-8859-1. In the "t" directory, "ls | hd"
displays:
00000000 74 65 73 74 f7 2e 74 78 74 0a |test..txt.|
0000000a
Now, here's a simple Python3 program:
import zipfile
z = zipfile.ZipFile("t.zip")
z.extractall()
If you run this on the relevant ZIP file, the 0xf7 character is replaced with a
Unicode sequence; "ls | hd" now displays:
00000000 74 65 73 74 e2 89 88 2e 74 78 74 0a |test....txt.|
0000000c
The impact within Python programs is equally bad. Fundamentally, the zipfile
interface is broken; it should not try to decode filenames into strings and
should instead treat them as bytes and leave potential decoding up to
applications. It appears to try, down various code paths, to decode filenames
as ascii, cp437, or utf-8. However, the ZIP file format was often used on Unix
systems as well, which didn't tend to use cp437 (iso-8859-* was more common).
In short, there is no way that zipfile.py can reliably guess the encoding of a
filename in a ZIP file, so it is a data-loss bug that it attempts and fails to
do so. It is a further bug that extractall mangles filenames; unzip(1) is
perfectly capable of extracting these files correctly. I'm attaching this zip
file for reference.
At the very least, zipfile should provide a bytes interface for filenames for
people that care about correctness.
----------
files: t.zip
messages: 357023
nosy: jgoerzen
priority: normal
severity: normal
status: open
title: zipfile: Corrupts filenames containing non-UTF8 characters
type: behavior
Added file: https://bugs.python.org/file48724/t.zip
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue38861>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com