On Mon, Aug 8, 2011 at 2:43 AM, Sundance <[email protected]> wrote:
> On Sun, Aug 07, 2011 at 07:01:24PM -0600, Aaron Meurer wrote:
>
>> Well, making the original text unicode in the first place is not
>> possible, as it is just grabbed from the variable values.
>
> Oh, actually, since you see text on the screen, then the conversion happens at
> some point. Even if bytes are sent in raw form to the terminal, there is still
> the underlying assumption of an encoding to interpret the bytes as text. 
> That's
> possibly the one important thing about dealing with Unicode: there is no such
> thing as 'raw text', and whenever you see text anywhere there's Unicode and
> encodings involved. The urwid documentation actually seems to say we're
> supposed to give it Unicode by default, and only use bytestrings if we know
> what we're doing.
>
>> But even so, I am making the text unicode first.  I'm doing text[i] =
>> (unicode(text[i][:maxcol-1]) + unicode(u'…') +
>> unicode(text[i][maxcol:])) (see the source).  But this has no effect.
>
> Ah, I think I can see a few problems here.
>
> 1/ The proper parenthese order should be:
>  unicode(text[i])[:maxcol-]...
>
> 2/ Casting to Unicode without specifying the encoding is bad mojo, because if
> unspecified Python i2 will generally use ASCII, which is almost never correct.
>
> Unfortunately, we have no easy way to know the encoding of the variables'
> contents. We could parse the 'coding' declaration of the source files, but, 
> uh.
> Annoying. In such cases, the proper approach is generally to go for a sane
> default and maybe make it configurable. A sane default here should probably be
> whatever is declared in the terminal's locale, since we can assume the user is
> opening and saving their source code with the same encoding.
>
> 3/ Casting to Unicode when unsure about the proper encoding pretty much
> requires a safety net, which Python conveniently provides:
>
>  >>> print unicode('\xe9', 'latin1')
>  é
>  >>> print unicode('\xe9', 'ascii')
>  TraceBack...
>  UnicodeDecodeError...
>  >>> print unicode('\xe9', 'ascii', 'replace')
>  �
>
> The 'replace' bit tells Python to replace the bytes it can't decode with a
> special Unicode character, usually a question mark inside a circle. You can
> also 'ignore' such bytes to leave them out of the decoded string entirely.
>
> 4/ Also, u'…' is already Unicode, no need to cast it.

I know.  I was trying really hard (but to no avail). :)

>
>
> So a possible implementation of your code could be:
>
> import locale
> loc, encoding = locale.getdefaultlocale()
> text[i] = text[i].decode(encoding, 'replace')[:maxcol-1] \
>        + u'…' \
>        + text[i].decode(encoding, 'replace')[maxcol:]
>
>> I think the problem might have something to do with the color codes.

I did try doing this (albiet with the second argument to unicode(),
I'm assuming that is basically the same).  urwid's detected_encoding
uses the locale module as I mentioned above.  This worked in the
terminal in IPython, but gave a UnicodeError when I put it in the PuDB
code.

>
> That too. ANSI dates back from waaaay before talking with 'em foreeners with
> their strange alphabets ever was an issue, and as such, makes a number of
> assumptions that are no longer true. In Spyrit (a project where I've got to
> parse bytes containing both ANSI and text of uncertain encoding) I deal by
> tokenizing ANSI out as early as possible. Conversely, here, the proper 
> approach
> would probably be to apply the ANSI formatting as late as possible, and
> generally after having turned your text back into bytes for the terminal's
> benefit.
>
> And this is where we veer into Not Worth It territory, so if you're not
> comfortable dealing with the whole Unicode-and-ANSI bytefest, I'd advise
> outright discarding the feature. Seriously, PuDB has been doing just great
> so far without the ellipsis character.
>
>> Anyway, I think unicode characters in urwid are just broken (at least
>> in Python 2; once the pudb Python 3 port is ready I'll try it there).
>
> According to the documentation: http://excess.org/urwid/wiki/TextEncodings
>
> ... urwid actually seems to do the right thing. :) (Although it does refer to
> bytestring as 'normal strings', which is usually a dangerous mindset).
>
> Apparently, Pygments also does the right thing: its text tokens are always
> Unicode, and its formatters take an 'encoding' argument for proper conversion
> into bytes.
>
> So this might actually be a bug in PuDB. I'll take a dive in the source if I
> have time, and see what gives.

That would be great.  I've given up trying to make this work.

Aaron Meurer

>
>> So I think getting the unicode … to work at the end of the variables
>> list is a loosing battle, at least in Python 2.
>
> Agreed. It would have been cool, but just not cool enough to offset the pain 
> of
> implementing it.
>
> Unicode is the number one reason why I can't wait for Python 3 to become
> standard. Soon, soon. :)
>
> Bye guys,
>
> -- S.
>

_______________________________________________
Pudb mailing list
[email protected]
http://lists.tiker.net/listinfo/pudb

Reply via email to