New submission from STINNER Victor <victor.stin...@haypocalc.com>:

If the fullpath to the python3 binary contains a non-ASCII character and the 
file system encoding is ASCII, Python fails with:
---
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: can't initialize sys standard streams
ImportError: No module named encodings.utf_8
Abandon
---

The file system encoding is set to ASCII if there is no locale (eg. LANG=C).

The problem is that the command line argument, especially argv[0], is stored to 
a wchar_t* string using surrogates to store undecodable bytes.

Attached patch fixes calculate_path() and import functions to support 
surrogates. Details:

 * Initialize Py_FileSystemDefaultEncoding earlier in Py_InitializeEx(), 
because its value is required to encode unicode using surrogates to bytes
 * Rename char2wchar() to _Py_char2wchar(), the function is not more static ; 
and create function _Py_wchar2char()
 * Escape surrogates (reimplement surrogateescape decoder) in calculate_path() 
subfunctions (_wstat, _wgetcwd, _Py_wreadlink)
 * Use surrogateescape error handler in find_module(), NullImporter_init() and 
zipimporter_init()
 * Write a "fastpath" (I don't know the right term: is it an hack?) for utf-8 
encoding with surrogateescape error handler in PyUnicode_AsEncodedObject() and 
PyUnicode_AsEncodedString(): required because these functions are called by 
codecs module is initialized

The patch is a work in progress: there are some FIXME (I don't know if the 
string should be encoded/decoded using surrogates or not).

I only tested ASCII and UTF-8 file system encodings. I don't know if we can 
support more encodings. Python has few builtin encodings. Other encodings are 
implemented in Python: we have to import them, but we need the codec to import 
a module, so...

I don't think that Windows is affected by this issue because it has a better 
API for unicode filenames and command line arguments, and most patched 
functions are surrounded by #ifndef WINDOWS ... #endif

----------
components: Unicode
files: surrogates_bootstrap-4.patch
keywords: patch
messages: 101815
nosy: haypo
severity: normal
status: open
title: Support surrogates in import ; install Python in a non-ASCII directory
versions: Python 3.1, Python 3.2
Added file: http://bugs.python.org/file16671/surrogates_bootstrap-4.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8242>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to