gentlestone wrote:
save in utf-8 the coding declaration also has to be utf-8

ok, I understand, but what's the problem? Unfortunately seems to be
the Python interactive
mode doesn't have unicode support. It recognize the latin-1 encoding
only.

So I have 2 options, how to write doctest:
1. Replace native charaters with their encoded representation like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" instead of u"Žabovitá
zmiešaná kaša"
2. Use latin-1 encoding, where the file is saved in utf-8

The first is bad because doctest is a great documenttion tool and it
is propably the main reason I use python. And something like
u"\u017dabovit\xe1 zmie\u0161an\xe1 ka\u0161a" is not a best
documentation style. But the tests work.

The second is bad, because the declaration is incorrect and if I use
it in Django model declaration for example I got bad data in the
application.

So what is the solution? Back to Java? :-)

Wait -- don't give up yet. Since I'm one of the ones who (partially) steered you wrong, let me try to help.

Key variable here is how your text editor behaves. Since I've never taken my (programming) text editor out of ASCII mode before this week, it took some experimenting (and more importantly a message from Piet on this thread) to make sense of things. I think I now know how to make my own editor (Komodo IDE) behave in this environment, and you probably can do as well or better. In fact, judging from your messages, you probably are doing much better on the editor front.

When I tried this morning to re-open that test file from yesterday, many of the characters were all messed up. I was okay as long as the project was still open, but not today. The editor itself apparently looks to that encoding declaration when it's deciding how to interpret the bytes on disk.

So I did the following, using Komodo IDE. I created a new file in the project. Before saving it, I used Edit->CurrentFileSettings->Properties->Encoding to set it to UTF-8. *NOW* I pasted the stuff from your email message. And added the
#-*- coding: utf-8 -*-

as the second line of the file.   Notice it's *NOT* latin-1.

At this point I save and run the file, and it seems to work fine.

My guess is that I could set these as default settings in Komodo, if I were doing UTF-8 very often, and it would become painless. I know I have certain stuff in my python template, and could add that encoding line as well.


Anyway, that gets us to the step of running the doctest. The trick here seems to be that we need to define the docstring as a Unicode docstring to have it interpreted correctly. Try adding the u in front of the triple quote as follows:

def downcode(name):
   u"""
   >>> downcode(u"Žabovitá zmiešaná kaša")
   u'Zabovita zmiesana kasa'
   """
   for key, value in _MAP.iteritems():
       name = name.replace(key, value)
   return name

Now, if the doctest passes, we seem to be in good shape.

There's another problem, that hopefully somebody else can help with. That's if doctest needs to report an error. When I deliberately changed the "expect" string I get an error like the following.

UnicodeEncodeError: 'ascii' codec can't encode character u'\u017d' in position 1
50: ordinal not in range(128)

I get a similar error if running the -v option on doctest. (Note that I do *NOT* get the error when running inside Komodo. And what I've read implies that the same would be true if running inside IDLE.) The problem is similar to the one you'd have doing a simple:

   print u"\u017d"

I think these are avoided if sys.stdout.encoding (and maybe sys.stderr.encoding) are set to utf-8. On my system they're set to None, which says to use "the system default encoding." On my system that would be ASCII, so I get the error. But perhaps yours is already something better.

I found links: http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/
                    http://wiki.python.org/moin/PrintFails
http://lists.macromates.com/textmate/2008-June/025735.html which indicate you may want to try:
set LC_CTYPE=en_GB.utf-8 python

at the command prompt before running python.  This could be system specific;  
it didn't work for me on XP.

The workaround that works for me (so far) is:

if __name__ == "__main__":
   import sys, codecs
   sys.stdout = codecs.getwriter('utf8')(sys.stdout)

   print u"Žabovitá zmiešaná kaša"
   import doctest
   doctest.testmod()

The codecs line tells python that stdout should use utf-8.  That doesn't make 
the characters look good on my console, but at least it avoids the errors.  I'm 
guessing that on my system I should use latin1 here instead of utf8.  But I 
don't want to confuse things.


HTH

DaveA

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to