On Sun, Aug 07, 2011 at 07:01:24PM -0600, Aaron Meurer wrote:

> Well, making the original text unicode in the first place is not
> possible, as it is just grabbed from the variable values.

Oh, actually, since you see text on the screen, then the conversion happens at
some point. Even if bytes are sent in raw form to the terminal, there is still
the underlying assumption of an encoding to interpret the bytes as text. That's
possibly the one important thing about dealing with Unicode: there is no such
thing as 'raw text', and whenever you see text anywhere there's Unicode and
encodings involved. The urwid documentation actually seems to say we're
supposed to give it Unicode by default, and only use bytestrings if we know
what we're doing.

> But even so, I am making the text unicode first.  I'm doing text[i] =
> (unicode(text[i][:maxcol-1]) + unicode(u'…') +
> unicode(text[i][maxcol:])) (see the source).  But this has no effect.

Ah, I think I can see a few problems here.

1/ The proper parenthese order should be:
  unicode(text[i])[:maxcol-]...

2/ Casting to Unicode without specifying the encoding is bad mojo, because if
unspecified Python i2 will generally use ASCII, which is almost never correct.

Unfortunately, we have no easy way to know the encoding of the variables'
contents. We could parse the 'coding' declaration of the source files, but, uh.
Annoying. In such cases, the proper approach is generally to go for a sane
default and maybe make it configurable. A sane default here should probably be
whatever is declared in the terminal's locale, since we can assume the user is
opening and saving their source code with the same encoding.

3/ Casting to Unicode when unsure about the proper encoding pretty much
requires a safety net, which Python conveniently provides:

  >>> print unicode('\xe9', 'latin1')
  é
  >>> print unicode('\xe9', 'ascii')
  TraceBack...
  UnicodeDecodeError...
  >>> print unicode('\xe9', 'ascii', 'replace')
  �
  
The 'replace' bit tells Python to replace the bytes it can't decode with a
special Unicode character, usually a question mark inside a circle. You can
also 'ignore' such bytes to leave them out of the decoded string entirely.

4/ Also, u'…' is already Unicode, no need to cast it.


So a possible implementation of your code could be:

import locale
loc, encoding = locale.getdefaultlocale()
text[i] = text[i].decode(encoding, 'replace')[:maxcol-1] \
        + u'…' \
        + text[i].decode(encoding, 'replace')[maxcol:]

> I think the problem might have something to do with the color codes.

That too. ANSI dates back from waaaay before talking with 'em foreeners with
their strange alphabets ever was an issue, and as such, makes a number of
assumptions that are no longer true. In Spyrit (a project where I've got to
parse bytes containing both ANSI and text of uncertain encoding) I deal by
tokenizing ANSI out as early as possible. Conversely, here, the proper approach
would probably be to apply the ANSI formatting as late as possible, and
generally after having turned your text back into bytes for the terminal's
benefit.

And this is where we veer into Not Worth It territory, so if you're not
comfortable dealing with the whole Unicode-and-ANSI bytefest, I'd advise
outright discarding the feature. Seriously, PuDB has been doing just great
so far without the ellipsis character.

> Anyway, I think unicode characters in urwid are just broken (at least
> in Python 2; once the pudb Python 3 port is ready I'll try it there).

According to the documentation: http://excess.org/urwid/wiki/TextEncodings

... urwid actually seems to do the right thing. :) (Although it does refer to
bytestring as 'normal strings', which is usually a dangerous mindset).

Apparently, Pygments also does the right thing: its text tokens are always
Unicode, and its formatters take an 'encoding' argument for proper conversion
into bytes.

So this might actually be a bug in PuDB. I'll take a dive in the source if I
have time, and see what gives.

> So I think getting the unicode … to work at the end of the variables
> list is a loosing battle, at least in Python 2.

Agreed. It would have been cool, but just not cool enough to offset the pain of
implementing it.

Unicode is the number one reason why I can't wait for Python 3 to become
standard. Soon, soon. :)

Bye guys,

-- S.

_______________________________________________
Pudb mailing list
[email protected]
http://lists.tiker.net/listinfo/pudb

Reply via email to