On 12/30/2012 7:39 PM, Marcel Rodrigues wrote:
I'm using Python 3.3 (CPython) and am having trouble getting the
standard gettext module to handle Unicode messages.

I have never even looked at the doc before, but I will take a look.

My problem can be isolated as follows:

I have 3 files in a folder: greeting.py, greeting.po and msgfmt.py.

-- greeting.py --
import gettext

t = gettext.translation("greeting", "locale", ["pt"])
_ = t.lgettext

gettext.lgettext(message)
Equivalent to gettext(), but the translation is returned in the preferred system encoding, if no other encoding was explicitly set with bind_textdomain_codeset().

Giving that 'preferred system encoding' apparent means 'locale.getpreferredencoding' and that seems to not be what you want, why are you using the 'l' version?


print("_charset = {0}\n".format(t._charset))
print(_("hello"))

A strong suggestion: whenever you want to print a string and the computation of the string (or bytes) involves encoding/decoding, separate the computation and the printing (on two separate line).

s = _("hello")
print(s)

The reason is that printing also requires encoding for the output device and that process can also generate a UnicodeError that may be hard to distinguish from an error in the computation of s itself.

-- EOF --

-- greeting.po --
msgid ""
msgstr ""
"Project-Id-Version: 1.0\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid "hello"
msgstr "olá"
-- EOF --

msgfmt.py was downloaded from
http://hg.python.org/cpython/file/9e6ead98762e/Tools/i18n/msgfmt.py,
since this tool apparently isn't included in the python3 package
available on Arch Linux official repositories.

It's probably also worth noting that the file greeting.po is encoded
itself as UTF-8.

 From that folder, I run the following commands:

$ mkdir -p locale/pt/LC_MESSAGES
$ python msgfmt.py -o !$/greeting.mo greeting.po
$ python greeting.py

The output is:
_charset = UTF-8

Traceback (most recent call last):
   File "greeting.py", line 7, in <module>
     print(_("hello"))
   File "/usr/lib/python3.3/gettext.py", line 314, in lgettext
     return tmsg.encode(locale.getpreferredencoding())
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in
position 2: ordinal not in range(128)

In particular, we have seen, in previous posts here, this exact error generated during printing rather than during the string computation and posters have wasted time looking for the error in the string or bytes computation itself.

My interpretation of this output is that even though gettext correctly
detects the MO file charset as UTF-8, it tries to encode the translated
message with the system's "preferred encoding", which happens to be ASCII.

Just as you seem to have requested ;-)

Anyone know why this happens? Is this a bug on my code? Maybe I have
misunderstood gettext...

You used lgettext (l = locale). As I said, I am new to this.

--
Terry Jan Reedy


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to