Re: unicode issue

Dave Angel Thu, 01 Oct 2009 03:52:09 -0700

gentlestone wrote:

save in utf-8 the coding declaration also has to be utf-8


ok, I understand, but what's the problem? Unfortunately seems to be
the Python interactive
mode doesn't have unicode support. It recognize the latin-1 encoding
only.

So I have 2 options, how to write doctest:
1. Replace native charaters with their encoded representation like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Žabovitá
zmiešaná kaša"
2. Use latin-1 encoding, where the file is saved in utf-8

The first is bad because doctest is a great documenttion tool and it
is propably the main reason I use python. And something like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
documentation style. But the tests work.

The second is bad, because the declaration is incorrect and if I use
it in Django model declaration for example I got bad data in the
application.

So what is the solution? Back to Java? :-)

Wait -- don't give up yet. Since I'm one of the ones who (partially)steered you wrong, let me try to help.

Key variable here is how your text editor behaves. Since I've nevertaken my (programming) text editor out of ASCII mode before this week,it took some experimenting (and more importantly a message from Piet onthis thread) to make sense of things. I think I now know how to make myown editor (Komodo IDE) behave in this environment, and you probably cando as well or better. In fact, judging from your messages, you probablyare doing much better on the editor front.

When I tried this morning to re-open that test file from yesterday, manyof the characters were all messed up. I was okay as long as the projectwas still open, but not today. The editor itself apparently looks tothat encoding declaration when it's deciding how to interpret the byteson disk.

So I did the following, using Komodo IDE. I created a new file in theproject. Before saving it, I usedEdit->CurrentFileSettings->Properties->Encoding to set it to UTF-8.*NOW* I pasted the stuff from your email message. And added the

#-*- coding: utf-8 -*-

as the second line of the file.   Notice it's *NOT* latin-1.

At this point I save and run the file, and it seems to work fine.

My guess is that I could set these as default settings in Komodo, if Iwere doing UTF-8 very often, and it would become painless. I know Ihave certain stuff in my python template, and could add that encodingline as well.

Anyway, that gets us to the step of running the doctest. The trick hereseems to be that we need to define the docstring as a Unicode docstringto have it interpreted correctly. Try adding the u in front of thetriple quote as follows:


def downcode(name):
   u"""
   >>> downcode(u"Žabovitá zmiešaná kaša")
   u'Zabovita zmiesana kasa'
   """
   for key, value in _MAP.iteritems():
       name = name.replace(key, value)
   return name

Now, if the doctest passes, we seem to be in good shape.

There's another problem, that hopefully somebody else can help with.That's if doctest needs to report an error. When I deliberately changedthe "expect" string I get an error like the following.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u017d' inposition 1

50: ordinal not in range(128)

I get a similar error if running the -v option on doctest. (Note thatI do *NOT* get the error when running inside Komodo. And what I've readimplies that the same would be true if running inside IDLE.) Theproblem is similar to the one you'd have doing a simple:


   print u"\u017d"

I think these are avoided if sys.stdout.encoding (and maybesys.stderr.encoding) are set to utf-8. On my system they're set toNone, which says to use "the system default encoding." On my systemthat would be ASCII, so I get the error. But perhaps yours is alreadysomething better.

I found links:http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/

                    http://wiki.python.org/moin/PrintFails

http://lists.macromates.com/textmate/2008-June/025735.htmlwhich indicate you may want to try:

set LC_CTYPE=en_GB.utf-8 python

at the command prompt before running python.  This could be system specific;  
it didn't work for me on XP.

The workaround that works for me (so far) is:

if __name__ == "__main__":
   import sys, codecs
   sys.stdout = codecs.getwriter('utf8')(sys.stdout)

   print u"Žabovitá zmiešaná kaša"
   import doctest
   doctest.testmod()

The codecs line tells python that stdout should use utf-8.  That doesn't make 
the characters look good on my console, but at least it avoids the errors.  I'm 
guessing that on my system I should use latin1 here instead of utf8.  But I 
don't want to confuse things.


HTH

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Re: unicode issue

Reply via email to