Re: [PyGreSQL] Possible bug in PyGreSQL for Unicode and Python 2.x

Christoph Zwerschke Wed, 25 Apr 2018 03:42:23 -0700

Am 25.04.2018 um 03:31 schrieb Murray Chapman:
> I'm seeing some unexpected behavior with PyGreSQL running under Python
> 2.x when reading utf8-encoded data from the db.
>
> Basically, PyGreSQL/Python2.x always seems to return it as a str()
> rather than a unicode().

That's a feature, not a bug. There are actually no special "unicode"columns in PostgreSQL. How strings are stored in PostgreSQL depends onwhich server side character set is configured for the database:https://www.postgresql.org/docs/current/static/multibyte.html

In PyGreSQL, strings have always been retrieved as native strings (str)which are encoded byte strings in Python 2 and unicode strings nowadaysin Python 3. The encoding depends on the selected client character set.There are several ways to change the client encoding (see link above).

The fact that PostgreSQL strings are converted to the Python str type(and this is true for both Python 2 and 3) is documented here:http://www.pygresql.org/contents/pg/adaptation.html#supported-data-types


> This means that under Python 2.x, raw bytes from the db column are
> handed through to the Python layer.

Right, but not really "raw" bytes because you have automatic characterset conversion depending on the selected server and client encoding.


> Here's proof that it's UTF-8 encoded:
>
>      postgres=> select array_agg(t) from (select
> ascii(regexp_split_to_table(unicode_column, '')) AS t from unicode_table
> where column_id='key') x;
>                          array_agg
>      --------------------------------------------------
>       {73,32,10084,32,72,117,99,107,97,98,101,101,115}
>      (1 row)

Right, the server side encoding is set to utf8 (the default nowadays),so ascii() returns the unicode code point of the heart symbol (10084).What you get in Python depends on the client encoding. If it is also setto utf8, then you get the utf8-encoded string:


>      >>> sys.version
>      '2.7.8 (default, Mar 31 2018, 02:47:11) \n[GCC 4.1.2 20070626 (Red
> Hat 4.1.2-14)]'
>      >>> print db.query("select unicode_column from unicode_table where
> column_id='key'").getresult()
>      [('I \xe2\x9d\xa4 Huckabees',)]

This is indeed the utf8-encoded string.

If you want to work with unicode strings in Python 2, you must manuallydecode them or you can also use the workaround you mentioned to do thisautomatically.

Again, all of this has historic reasons. In Python 2 ("legacy Python"),the native strings were byte strings and it was considered normal towork with encoded strings. That's why PyGreSQL always returned bytestrings in Python 2 (it was the "expected thing" at that time) and wecan't change that for backward compatibility reasons. In Python 3 thesituation is different, strings are unicode by default now, that's whyPyGreSQL now returns unicode in Python 3 as well. This is one of themain reasons why it is highly recommended to switch to Python 3.


Hope this clarifies your issue.

-- Christoph
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Re: [PyGreSQL] Possible bug in PyGreSQL for Unicode and Python 2.x

Reply via email to