New submission from John Goerzen <jgoer...@users.sourceforge.net>:

The zipfile.py standard library component contains a number of pieces of 
questionable handling of non-UTF8 filenames.  As the ZIP file format predated 
Unicode by a significant number of years, this is actually fairly common with 
older code.

Here is a very simple reproduction case. 

mkdir t
cd t
echo hi > `printf 'test\xf7.txt'`
cd ..
zip -9r t.zip t

0xf7 is the division sign in ISO-8859-1.  In the "t" directory, "ls | hd" 
displays:

00000000  74 65 73 74 f7 2e 74 78  74 0a                    |test..txt.|
0000000a


Now, here's a simple Python3 program:

import zipfile

z = zipfile.ZipFile("t.zip")
z.extractall()

If you run this on the relevant ZIP file, the 0xf7 character is replaced with a 
Unicode sequence; "ls | hd" now displays:

00000000  74 65 73 74 e2 89 88 2e  74 78 74 0a              |test....txt.|
0000000c

The impact within Python programs is equally bad.  Fundamentally, the zipfile 
interface is broken; it should not try to decode filenames into strings and 
should instead treat them as bytes and leave potential decoding up to 
applications.  It appears to try, down various code paths, to decode filenames 
as ascii, cp437, or utf-8.  However, the ZIP file format was often used on Unix 
systems as well, which didn't tend to use cp437 (iso-8859-* was more common).  
In short, there is no way that zipfile.py can reliably guess the encoding of a 
filename in a ZIP file, so it is a data-loss bug that it attempts and fails to 
do so.  It is a further bug that extractall mangles filenames; unzip(1) is 
perfectly capable of extracting these files correctly.  I'm attaching this zip 
file for reference.

At the very least, zipfile should provide a bytes interface for filenames for 
people that care about correctness.

----------
files: t.zip
messages: 357023
nosy: jgoerzen
priority: normal
severity: normal
status: open
title: zipfile: Corrupts filenames containing non-UTF8 characters
type: behavior
Added file: https://bugs.python.org/file48724/t.zip

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue38861>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to