[Python-Dev] Unicode -- UTF-8 in CPython extension modules
I've uncovered what seems to me to a problem with python Unicode string objects passed to extension modules. Or perhaps it's revealing a misunderstanding on my part :-) So I would like to get some clarification. Extension modules written in C receive strings from python via the PyArg_ParseTuple family. Most extension modules use the 's' or 's#' format parameter. Many C libraries in Linux use the UTF-8 encoding. The 's' format when passed a Unicode object will encode the string according to the default encoding which is immutably set to 'ascii' in site.py. Thus a C library expecting UTF-8 which uses the 's' format in PyArg_ParseTuple will get an encoding error when passed a Unicode string which contains any code points outside the ascii range. Now my questions: * Is the use of the 's' or 's*' format parameter in an extension binding expecting UTF-8 fundamentally broken and not expected to work? Instead should the binding be using a format conversion which specifies the desired encoding, e.g. 'es' or 'es#'? * The extension modules could successfully use the 's' or 's#' format conversion in a UTF-8 environment if the default encoding was UTF-8. Changing the default encoding to UTF-8 would in one easy stroke fix most extension modules, right? Why is the default encoding 'ascii' in UTF-8 environments and why is the default encoding prohibited from being changed from ascii? * Did Python 2.5 introduce anything which now makes this issue visible whereas before it was masked by some other behavior? Summary: Python programs which use Unicode string objects for their i18n and which link to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors. However, I have yet to see a CPython binding which does explicitly define it's encoding requirements. This suggests to me I either do not understand the issue in it's entirety or many CPython bindings in Linux UTF-8 environments are broken with respect to their i18n handling and the problem is currently not addressed. -- John Dennis [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode -- UTF-8 in CPython extension modules
I've uncovered what seems to me to a problem with python Unicode string objects passed to extension modules. Or perhaps it's revealing a misunderstanding on my part :-) So I would like to get some clarification. It seems to me that there is indeed one or more misunderstandings on your part. Please discuss them on comp.lang.python. Extension modules written in C receive strings from python via the PyArg_ParseTuple family. Most extension modules use the 's' or 's#' format parameter. Many C libraries in Linux use the UTF-8 encoding. The 's' format when passed a Unicode object will encode the string according to the default encoding which is immutably set to 'ascii' in site.py. Thus a C library expecting UTF-8 which uses the 's' format in PyArg_ParseTuple will get an encoding error when passed a Unicode string which contains any code points outside the ascii range. The C library isn't expecting using the 's' format. A Python module wrapping the C library is. So whatever conversion is necessary should be done by that Python module. Now my questions: * Is the use of the 's' or 's*' format parameter in an extension binding expecting UTF-8 fundamentally broken and not expected to work? Instead should the binding be using a format conversion which specifies the desired encoding, e.g. 'es' or 'es#'? Yes. Alternatively, require the callers to pass UTF-8 byte strings, not Unicode strings. * The extension modules could successfully use the 's' or 's#' format conversion in a UTF-8 environment if the default encoding was UTF-8. Changing the default encoding to UTF-8 would in one easy stroke fix most extension modules, right? Wrong. This assumes that most libraries do indeed specify their APIs in terms of UTF-8. I don't think that is a fact; not in the world of 2008. Why is the default encoding 'ascii' in UTF-8 environments and why is the default encoding prohibited from being changed from ascii? There are several reasons, all off-topic for python-dev. ASCII was considered the most safe assumption: when converting between byte and Unicode strings in the absence of an encoding specification, you can't assume anything but ASCII (technically, not even that, as the bytes may be EBCDIC, but ASCII is safe for the majority of the systems - unlike UTF-8). The encoding can't be changed because that would break hash(). * Did Python 2.5 introduce anything which now makes this issue visible whereas before it was masked by some other behavior? I don't know. Can you please be a bit more specific (on comp.lang.python) where you suspect such a change? Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode -- UTF-8 in CPython extension modules
On 2008-02-23 00:46, Colin Walters wrote: On Fri, Feb 22, 2008 at 4:23 PM, John Dennis [EMAIL PROTECTED] wrote: Python programs which use Unicode string objects for their i18n and which link to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors. One thing to be aware of is that PyGTK+ actually sets the Python Unicode object encoding to UTF-8. http://bugzilla.gnome.org/show_bug.cgi?id=132040 I mention this because PyGTK is a very popular library related to Python and Linux. So currently if you import gtk, then libraries which are using UTF-8 (as you say, the vast majority) will work with Python unicode objects unmodified. Are you suggesting that John should rely on a bug in some 3rd party extension instead of fixing the Python extension to use es# where needed ? There's a good reason why we don't allow setting the default encoding outside site.py. Trying to play tricks to change the default encoding later on will only cause problems, e.g. the cached default encoded versions of Unicode objects will then use different encodings - the one set in site.py and later the ones with the new encoding. As a result, all kind of weird things can happen. Using the Python Unicode C API really isn't all that hard and it's well documented too, so please use it instead of trying to design software based on workarounds. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Feb 23 2008) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Unicode -- UTF-8 in CPython extension modules
Colin Walters wrote: On Fri, Feb 22, 2008 at 4:23 PM, John Dennis [EMAIL PROTECTED] wrote: Python programs which use Unicode string objects for their i18n and which link to C libraries expecting UTF-8 but which have a CPython binding which only uses 's' or 's#' formats programs seem to often fail with encoding errors. One thing to be aware of is that PyGTK+ actually sets the Python Unicode object encoding to UTF-8. http://bugzilla.gnome.org/show_bug.cgi?id=132040 I mention this because PyGTK is a very popular library related to Python and Linux. So currently if you import gtk, then libraries which are using UTF-8 (as you say, the vast majority) will work with Python unicode objects unmodified. Thank you Colin, your input was very helpful. The fact PyGTK's i18n handling worked was the counter example which made me doubt my analysis was correct but I can see from the Gnome bug report and Martin's subsequent comment that the analysis was sound. It had perplexed me enormously why in some circumstances i18n handling worked but failed in others. Apparently it was a side effect of importing gtk, a problem exacerbated when either the sequence of imports or the complete set of imports was not taken into account. I am aware of other python bindings (libxml2 is one example) which share the same mistake of not using the 'es' family of format conversions when the underlying library is UTF-8. At least I now understand why incorrectly coded bindings in some circumstances produced correct results when logic dictated they shouldn't. -- John Dennis [EMAIL PROTECTED] ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com