Toshio Kuratomi added the comment:

I found some "standards" docs that could bear on this:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D:
"D.1 The ZIP format has historically supported only the original IBM PC 
character encoding set, commonly referred to as IBM Code Page 437."
[..]
"D.2 If general purpose bit 11 is unset, the file name and comment should 
conform to the original ZIP character encoding.  If general purpose bit 11 is 
set, the filename and comment must support The Unicode Standard, Version 4.1.0 
or greater using the character encoding form defined by the UTF-8 storage 
specification."
[..]

So there's two choices for a filename in a zipfile:

* bytes that make valid UTF-8 strings
* bytes that make valid strings in code page 437

http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page

Code Page 437 takes up all 256 possible bit patterns available in a byte.

These two factors mean that if a filename in a zipfile is considered from the 
POV of a sequence of bytes, it can (according to the zipfile standard) contain 
any possible sequence of bytes.  If a filename is considered from the POV of a 
sequence of human characters, it can contain any possible sequence of unicode 
code points encoded as utf-8.  

The tricky bit: if the bytes are not valid utf-8 then officially the characters 
should be limited to the 256 characters of Code Page 437.   However, the client 
tools I've looked at exploit the fact that all bytes are possible to simply 
save the bytes that make up the filename into the zip file.

----------
nosy: +a.badger

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue16310>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to