STINNER Victor <victor.stin...@haypocalc.com> added the comment:

I created a tarball (.tar.gz) on Windows with Python 3.1 (which uses "mbcs" 
encoding). With locale.getpreferredencoding() == 'cp1252', "é" (U+00e9) is 
encoded 0xe9 (1 byte) and "à" (U+00e0) as 0xe0 (1 byte). WinRAR displays 
correctly the file names, but 7-zip displays the wrong glyphs.

So WinRAR expects CP1252 whereas 7-zip expects CP850.

I also tested an archive encoded with UTF-8: WinRAR and 7-zip display the wrong 
glyph, they decode utf-8 with CP1252 / CP850 :-/

If an archive will be used on UNIX, I think that the archive should use UTF-8 
(on Windows and UNIX). But if the archive is read on Windows with WinRAR or 
7-zip, the archive should use a codepage.

Since mbcs looks to be the least worst choice, it may be used but with 
"replace" error handler (because it doesn't support "surrogateescape" error 
handler).

--

About the code pages:

 - chcp command displays "Active code page: 850"
 - python -c "import locale; print(locale.getpreferredencoding())" displays 
"cp1252"
 - python -c "import sys; print(sys.stdout.encoding)" displays "cp850"

Python calls GetConsoleOutputCP() to get stdout/stderr encoding (code page), 
whereas locale.getpreferredencoding() (_locale.getdefaultencoding()) calls 
GetACP().

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8784>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to