On Mon, 04 Oct 2010 12:32:26 -0400, Scott Dial <scott+python-...@scottdial.com> wrote: > On 10/2/2010 7:00 PM, R. David Murray wrote: > > The clever hack (thanks ultimately to Martin) is to accept 8bit data > > by encoding it using the ASCII codec and the surrogateescape error > > handler. > > I've seen this idea pop up in a number of threads. I worry that you are > all inventing a new kind of dual that is a direct parallel to Python 2.x > strings.
Yes, that is exactly my worry. > That is to say, > > 3.x>>> b = b'\xc2\xa1' > 3.x>>> s = b.decode('utf8') > 3.x>>> v = b.decode('ascii', 'surrogateescape') > > , where s and v should be the same "thing" in 3.x but they are not due > to an encoding trick. Why "should" they be the same thing in 3.x? One is an ASCII string with some escaped bytes in an unknown encoding, the other is a valid unicode string. The surrogateescape trick is used only when we don't *know* the encoding (a priori) of the bytes in question. > I believe this trick generates more-or-less the same issues as strings > did in 2.x: > > 2.x>>> b = '\xc2\xa1' > 2.x>>> s = b.decode('utf8') > 2.x>>> v = b The difference is that in 2.x people could and would operate on strings as if they knew the encoding, and get in trouble. In 3.x you can't do that. If you've got escaped bytes you *know* that you don't know the encoding, and the program can't get around that except by re-encoding to bytes and properly decoding them. > Any reasonable 2.x code has to guard on str/unicode and it would seem in > 3.x, if this idiom spreads, reasonable code will have to guard on > surrogate escapes (which actually seems like a more expensive test). As in, > > 3.x>>> print(v) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc2' in > position 0: surrogates not allowed Right, I mentioned that concern in my post. In this case at least, however, the *goal* is that the surrogates are never seen outside the email internals. In reflection of this, my latest thought is that I should add a 'message_from_binary_file' helper method and a 'feedbytes' method to feedparser, making the surrogates a 100% internal implementation detail[*]. Only if the email package contains a coding error would the surrogates escape and cause problems for user code. > It seems like this hack is about making the 3.x unicode type more like > the 2.x string type, and I thought we decided that was a bad idea. How > will developers not have to ask themselves whether a given string is a > "real" string or a byte sequence masquerading as a string? Am I missing > something here? I think this question is something that needs to be considered any time using surrogates is proposed. I hope that in the email package proposal I've addressed it. What do you think? --David [*] And you are right that there is a performance concern as a result of needing to detect surrogates at various points in the code. _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com