Am 25.04.2018 um 03:31 schrieb Murray Chapman:
> I'm seeing some unexpected behavior with PyGreSQL running under Python
> 2.x when reading utf8-encoded data from the db.
>
> Basically, PyGreSQL/Python2.x always seems to return it as a str()
> rather than a unicode().

That's a feature, not a bug. There are actually no special "unicode" columns in PostgreSQL. How strings are stored in PostgreSQL depends on which server side character set is configured for the database: https://www.postgresql.org/docs/current/static/multibyte.html

In PyGreSQL, strings have always been retrieved as native strings (str) which are encoded byte strings in Python 2 and unicode strings nowadays in Python 3. The encoding depends on the selected client character set. There are several ways to change the client encoding (see link above).

The fact that PostgreSQL strings are converted to the Python str type (and this is true for both Python 2 and 3) is documented here: http://www.pygresql.org/contents/pg/adaptation.html#supported-data-types

> This means that under Python 2.x, raw bytes from the db column are
> handed through to the Python layer.

Right, but not really "raw" bytes because you have automatic character set conversion depending on the selected server and client encoding.

> Here's proof that it's UTF-8 encoded:
>
>      postgres=> select array_agg(t) from (select
> ascii(regexp_split_to_table(unicode_column, '')) AS t from unicode_table
> where column_id='key') x;
>                          array_agg
>      --------------------------------------------------
>       {73,32,10084,32,72,117,99,107,97,98,101,101,115}
>      (1 row)

Right, the server side encoding is set to utf8 (the default nowadays), so ascii() returns the unicode code point of the heart symbol (10084). What you get in Python depends on the client encoding. If it is also set to utf8, then you get the utf8-encoded string:

>      >>> sys.version
>      '2.7.8 (default, Mar 31 2018, 02:47:11) \n[GCC 4.1.2 20070626 (Red
> Hat 4.1.2-14)]'
>      >>> print db.query("select unicode_column from unicode_table where
> column_id='key'").getresult()
>      [('I \xe2\x9d\xa4 Huckabees',)]

This is indeed the utf8-encoded string.

If you want to work with unicode strings in Python 2, you must manually decode them or you can also use the workaround you mentioned to do this automatically.

Again, all of this has historic reasons. In Python 2 ("legacy Python"), the native strings were byte strings and it was considered normal to work with encoded strings. That's why PyGreSQL always returned byte strings in Python 2 (it was the "expected thing" at that time) and we can't change that for backward compatibility reasons. In Python 3 the situation is different, strings are unicode by default now, that's why PyGreSQL now returns unicode in Python 3 as well. This is one of the main reasons why it is highly recommended to switch to Python 3.

Hope this clarifies your issue.

-- Christoph
_______________________________________________
PyGreSQL mailing list
[email protected]
https://mail.vex.net/mailman/listinfo.cgi/pygresql

Reply via email to