On 12/30/2012 7:39 PM, Marcel Rodrigues wrote:
I'm using Python 3.3 (CPython) and am having trouble getting the
standard gettext module to handle Unicode messages.
I have never even looked at the doc before, but I will take a look.
My problem can be isolated as follows:
I have 3 files in a folder: greeting.py, greeting.po and msgfmt.py.
-- greeting.py --
import gettext
t = gettext.translation("greeting", "locale", ["pt"])
_ = t.lgettext
gettext.lgettext(message)
Equivalent to gettext(), but the translation is returned in the
preferred system encoding, if no other encoding was explicitly set with
bind_textdomain_codeset().
Giving that 'preferred system encoding' apparent means
'locale.getpreferredencoding' and that seems to not be what you want,
why are you using the 'l' version?
print("_charset = {0}\n".format(t._charset))
print(_("hello"))
A strong suggestion: whenever you want to print a string and the
computation of the string (or bytes) involves encoding/decoding,
separate the computation and the printing (on two separate line).
s = _("hello")
print(s)
The reason is that printing also requires encoding for the output device
and that process can also generate a UnicodeError that may be hard to
distinguish from an error in the computation of s itself.
-- EOF --
-- greeting.po --
msgid ""
msgstr ""
"Project-Id-Version: 1.0\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
msgid "hello"
msgstr "olá"
-- EOF --
msgfmt.py was downloaded from
http://hg.python.org/cpython/file/9e6ead98762e/Tools/i18n/msgfmt.py,
since this tool apparently isn't included in the python3 package
available on Arch Linux official repositories.
It's probably also worth noting that the file greeting.po is encoded
itself as UTF-8.
From that folder, I run the following commands:
$ mkdir -p locale/pt/LC_MESSAGES
$ python msgfmt.py -o !$/greeting.mo greeting.po
$ python greeting.py
The output is:
_charset = UTF-8
Traceback (most recent call last):
File "greeting.py", line 7, in <module>
print(_("hello"))
File "/usr/lib/python3.3/gettext.py", line 314, in lgettext
return tmsg.encode(locale.getpreferredencoding())
UnicodeEncodeError: 'ascii' codec can't encode character '\xe1' in
position 2: ordinal not in range(128)
In particular, we have seen, in previous posts here, this exact error
generated during printing rather than during the string computation and
posters have wasted time looking for the error in the string or bytes
computation itself.
My interpretation of this output is that even though gettext correctly
detects the MO file charset as UTF-8, it tries to encode the translated
message with the system's "preferred encoding", which happens to be ASCII.
Just as you seem to have requested ;-)
Anyone know why this happens? Is this a bug on my code? Maybe I have
misunderstood gettext...
You used lgettext (l = locale). As I said, I am new to this.
--
Terry Jan Reedy
--
http://mail.python.org/mailman/listinfo/python-list