[Python-Dev] Unicode -- UTF-8 in CPython extension modules

2008-02-22 Thread John Dennis
I've uncovered what seems to me to a problem with python Unicode
string objects passed to extension modules. Or perhaps it's revealing
a misunderstanding on my part :-) So I would like to get some
clarification.

Extension modules written in C receive strings from python via the
PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
format parameter.

Many C libraries in Linux use the UTF-8 encoding.

The 's' format when passed a Unicode object will encode the string
according to the default encoding which is immutably set to 'ascii' in
site.py. Thus a C library expecting UTF-8 which uses the 's' format in
PyArg_ParseTuple will get an encoding error when passed a Unicode
string which contains any code points outside the ascii range.

Now my questions:

* Is the use of the 's' or 's*' format parameter in an extension
   binding expecting UTF-8 fundamentally broken and not expected to
   work?  Instead should the binding be using a format conversion which
   specifies the desired encoding, e.g. 'es' or 'es#'?

* The extension modules could successfully use the 's' or 's#' format
   conversion in a UTF-8 environment if the default encoding was
   UTF-8. Changing the default encoding to UTF-8 would in one easy
   stroke fix most extension modules, right? Why is the default
   encoding 'ascii' in UTF-8 environments and why is the default
   encoding prohibited from being changed from ascii?

* Did Python 2.5 introduce anything which now makes this issue visible
   whereas before it was masked by some other behavior?

Summary:

Python programs which use Unicode string objects for their i18n and
which link to C libraries expecting UTF-8 but which have a CPython
binding which only uses 's' or 's#' formats programs seem to often
fail with encoding errors. However, I have yet to see a CPython
binding which does explicitly define it's encoding requirements. This
suggests to me I either do not understand the issue in it's entirety
or many CPython bindings in Linux UTF-8 environments are broken with
respect to their i18n handling and the problem is currently
not addressed.

-- 
John Dennis [EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode -- UTF-8 in CPython extension modules

2008-02-22 Thread Martin v. Löwis
 I've uncovered what seems to me to a problem with python Unicode
 string objects passed to extension modules. Or perhaps it's revealing
 a misunderstanding on my part :-) So I would like to get some
 clarification.

It seems to me that there is indeed one or more misunderstandings
on your part. Please discuss them on comp.lang.python.

 Extension modules written in C receive strings from python via the
 PyArg_ParseTuple family. Most extension modules use the 's' or 's#'
 format parameter.
 
 Many C libraries in Linux use the UTF-8 encoding.
 
 The 's' format when passed a Unicode object will encode the string
 according to the default encoding which is immutably set to 'ascii' in
 site.py. Thus a C library expecting UTF-8 which uses the 's' format in
 PyArg_ParseTuple will get an encoding error when passed a Unicode
 string which contains any code points outside the ascii range.

The C library isn't expecting  using the 's' format. A Python module
wrapping the C library is. So whatever conversion is necessary should
be done by that Python module.

 Now my questions:
 
 * Is the use of the 's' or 's*' format parameter in an extension
binding expecting UTF-8 fundamentally broken and not expected to
work?  Instead should the binding be using a format conversion which
specifies the desired encoding, e.g. 'es' or 'es#'?

Yes. Alternatively, require the callers to pass UTF-8 byte strings,
not Unicode strings.

 * The extension modules could successfully use the 's' or 's#' format
conversion in a UTF-8 environment if the default encoding was
UTF-8. Changing the default encoding to UTF-8 would in one easy
stroke fix most extension modules, right?

Wrong. This assumes that most libraries do indeed specify their
APIs in terms of UTF-8. I don't think that is a fact; not in the world
of 2008.

 Why is the default
encoding 'ascii' in UTF-8 environments and why is the default
encoding prohibited from being changed from ascii?

There are several reasons, all off-topic for python-dev.
ASCII was considered the most safe assumption: when
converting between byte and Unicode strings in the absence of an
encoding specification, you can't assume anything but ASCII
(technically, not even that, as the bytes may be EBCDIC, but ASCII
is safe for the majority of the systems - unlike UTF-8).
The encoding can't be changed because that would break hash().

 * Did Python 2.5 introduce anything which now makes this issue visible
whereas before it was masked by some other behavior?

I don't know. Can you please be a bit more specific (on 
comp.lang.python) where you suspect such a change?

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode -- UTF-8 in CPython extension modules

2008-02-22 Thread M.-A. Lemburg
On 2008-02-23 00:46, Colin Walters wrote:
 On Fri, Feb 22, 2008 at 4:23 PM, John Dennis [EMAIL PROTECTED] wrote:
 
  Python programs which use Unicode string objects for their i18n and
  which link to C libraries expecting UTF-8 but which have a CPython
  binding which only uses 's' or 's#' formats programs seem to often
  fail with encoding errors.
 
 One thing to be aware of is that PyGTK+ actually sets the Python
 Unicode object encoding to UTF-8.
 
 http://bugzilla.gnome.org/show_bug.cgi?id=132040
 
 I mention this because PyGTK is a very popular library related to
 Python and Linux.  So currently if you import gtk, then libraries
 which are using UTF-8 (as you say, the vast majority) will work with
 Python unicode objects unmodified.

Are you suggesting that John should rely on a bug in some 3rd party
extension instead of fixing the Python extension to use es# where
needed ?

There's a good reason why we don't allow setting the default
encoding outside site.py.

Trying to play tricks to change the default encoding later on
will only cause problems, e.g. the cached default encoded versions
of Unicode objects will then use different encodings - the one set
in site.py and later the ones with the new encoding. As a result,
all kind of weird things can happen.

Using the Python Unicode C API really isn't all that hard and it's
well documented too, so please use it instead of trying to design
software based on workarounds.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Feb 23 2008)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


 Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Unicode -- UTF-8 in CPython extension modules

2008-02-22 Thread John Dennis
Colin Walters wrote:
 On Fri, Feb 22, 2008 at 4:23 PM, John Dennis [EMAIL PROTECTED] wrote:
 
  Python programs which use Unicode string objects for their i18n and
  which link to C libraries expecting UTF-8 but which have a CPython
  binding which only uses 's' or 's#' formats programs seem to often
  fail with encoding errors.
 
 One thing to be aware of is that PyGTK+ actually sets the Python
 Unicode object encoding to UTF-8.
 
 http://bugzilla.gnome.org/show_bug.cgi?id=132040
 
 I mention this because PyGTK is a very popular library related to
 Python and Linux.  So currently if you import gtk, then libraries
 which are using UTF-8 (as you say, the vast majority) will work with
 Python unicode objects unmodified.

Thank you Colin, your input was very helpful. The fact PyGTK's i18n 
handling worked was the counter example which made me doubt my analysis 
was correct but I can see from the Gnome bug report and Martin's 
subsequent comment that the analysis was sound. It had perplexed me 
enormously why in some circumstances i18n handling worked but failed in 
others. Apparently it was a side effect of importing gtk, a problem 
exacerbated when either the sequence of imports or the complete set of 
imports was not taken into account.

I am aware of other python bindings (libxml2 is one example) which share 
the same mistake of not using the 'es' family of format conversions when 
the underlying library is UTF-8. At least I now understand why 
incorrectly coded bindings in some circumstances produced correct 
results when logic dictated they shouldn't.

-- 
John Dennis [EMAIL PROTECTED]
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com