On Mon, Aug 8, 2011 at 2:43 AM, Sundance <[email protected]> wrote: > On Sun, Aug 07, 2011 at 07:01:24PM -0600, Aaron Meurer wrote: > >> Well, making the original text unicode in the first place is not >> possible, as it is just grabbed from the variable values. > > Oh, actually, since you see text on the screen, then the conversion happens at > some point. Even if bytes are sent in raw form to the terminal, there is still > the underlying assumption of an encoding to interpret the bytes as text. > That's > possibly the one important thing about dealing with Unicode: there is no such > thing as 'raw text', and whenever you see text anywhere there's Unicode and > encodings involved. The urwid documentation actually seems to say we're > supposed to give it Unicode by default, and only use bytestrings if we know > what we're doing. > >> But even so, I am making the text unicode first. I'm doing text[i] = >> (unicode(text[i][:maxcol-1]) + unicode(u'…') + >> unicode(text[i][maxcol:])) (see the source). But this has no effect. > > Ah, I think I can see a few problems here. > > 1/ The proper parenthese order should be: > unicode(text[i])[:maxcol-]... > > 2/ Casting to Unicode without specifying the encoding is bad mojo, because if > unspecified Python i2 will generally use ASCII, which is almost never correct. > > Unfortunately, we have no easy way to know the encoding of the variables' > contents. We could parse the 'coding' declaration of the source files, but, > uh. > Annoying. In such cases, the proper approach is generally to go for a sane > default and maybe make it configurable. A sane default here should probably be > whatever is declared in the terminal's locale, since we can assume the user is > opening and saving their source code with the same encoding. > > 3/ Casting to Unicode when unsure about the proper encoding pretty much > requires a safety net, which Python conveniently provides: > > >>> print unicode('\xe9', 'latin1') > é > >>> print unicode('\xe9', 'ascii') > TraceBack... > UnicodeDecodeError... > >>> print unicode('\xe9', 'ascii', 'replace') > � > > The 'replace' bit tells Python to replace the bytes it can't decode with a > special Unicode character, usually a question mark inside a circle. You can > also 'ignore' such bytes to leave them out of the decoded string entirely. > > 4/ Also, u'…' is already Unicode, no need to cast it.
I know. I was trying really hard (but to no avail). :) > > > So a possible implementation of your code could be: > > import locale > loc, encoding = locale.getdefaultlocale() > text[i] = text[i].decode(encoding, 'replace')[:maxcol-1] \ > + u'…' \ > + text[i].decode(encoding, 'replace')[maxcol:] > >> I think the problem might have something to do with the color codes. I did try doing this (albiet with the second argument to unicode(), I'm assuming that is basically the same). urwid's detected_encoding uses the locale module as I mentioned above. This worked in the terminal in IPython, but gave a UnicodeError when I put it in the PuDB code. > > That too. ANSI dates back from waaaay before talking with 'em foreeners with > their strange alphabets ever was an issue, and as such, makes a number of > assumptions that are no longer true. In Spyrit (a project where I've got to > parse bytes containing both ANSI and text of uncertain encoding) I deal by > tokenizing ANSI out as early as possible. Conversely, here, the proper > approach > would probably be to apply the ANSI formatting as late as possible, and > generally after having turned your text back into bytes for the terminal's > benefit. > > And this is where we veer into Not Worth It territory, so if you're not > comfortable dealing with the whole Unicode-and-ANSI bytefest, I'd advise > outright discarding the feature. Seriously, PuDB has been doing just great > so far without the ellipsis character. > >> Anyway, I think unicode characters in urwid are just broken (at least >> in Python 2; once the pudb Python 3 port is ready I'll try it there). > > According to the documentation: http://excess.org/urwid/wiki/TextEncodings > > ... urwid actually seems to do the right thing. :) (Although it does refer to > bytestring as 'normal strings', which is usually a dangerous mindset). > > Apparently, Pygments also does the right thing: its text tokens are always > Unicode, and its formatters take an 'encoding' argument for proper conversion > into bytes. > > So this might actually be a bug in PuDB. I'll take a dive in the source if I > have time, and see what gives. That would be great. I've given up trying to make this work. Aaron Meurer > >> So I think getting the unicode … to work at the end of the variables >> list is a loosing battle, at least in Python 2. > > Agreed. It would have been cool, but just not cool enough to offset the pain > of > implementing it. > > Unicode is the number one reason why I can't wait for Python 3 to become > standard. Soon, soon. :) > > Bye guys, > > -- S. > _______________________________________________ Pudb mailing list [email protected] http://lists.tiker.net/listinfo/pudb
