On Thu, Jun 24, 2010 at 12:38 PM, Bill Janssen <jans...@parc.com> wrote:
> Here are a couple of ideas I'm taking away from the bytes/string > discussion. > > First, it would probably be a good idea to have a String ABC. > > Secondly, maybe the string situation in 2.x wasn't as broken as we > thought it was. In particular, those who deal with lots of encoded > strings seemed to find it handy, and miss it in 3.x. Perhaps strings > are more like numbers than we think. We have separate types for int, > float, Decimal, etc. But they're all numbers, and they all > cross-operate. In 2.x, it seems there were two missing features: no > encoding attribute on str, which should have been there and should have > been required, and the default encoding being "ASCII" (I can't tell you > how many times I've had to fix that issue when a non-ASCII encoded str > was passed to some output function). > I've started to form a conceptual notion that I think fits these cases. We've setup a system where we think of text as natively unicode, with encodings to put that unicode into a byte form. This is certainly appropriate in a lot of cases. But there's a significant class of problems where bytes are the native structure. Network protocols are what we've been discussing, and are a notable case of that. That is, b'/' is the most native sense of a path separator in a URL, or b':' is the most native sense of what separates a header name from a header value in HTTP. To disallow unicode URLs or unicode HTTP headers would be rather anti-social, especially because unicode is now the "native" string type in Python 3 (as an aside for the WSGI spec we've been talking about using "native" strings in some positions like dictionary keys, meaning Python 2 str and Python 3 str, while being more exacting in other areas such as a response body which would always be bytes). The HTTP spec and other network protocols seems a little fuzzy on this, because it was written before unicode even existed, and even later activity happened at a point when "unicode" and "text" weren't widely considered the same thing like they are now. But I think the original intention is revealed in a more modern specification like WebSockets, where they are very explicit that ':' is just shorthand for a particular byte, it is not "text" in our new modern notion of the term. So with this idea in mind it makes more sense to me that *specific pieces of text* can be reasonably treated as both bytes and text. All the string literals in urllib.parse.urlunspit() for example. The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does not become special('/x')) and special('/')+x=='/x' (again it becomes str). This avoids some of the cases of unicode or str infecting a system as they did in Python 2 (where you might pass in unicode and everything works fine until some non-ASCII is introduced). The one place where this might be tricky is if you have an encoding that is not ASCII compatible. But we can't guard against every possibility. So it would be entirely wrong to take a string encoded with UTF-16 and start to use b'/' with it. But there are other nonsensical combinations already possible, especially with polymorphic functions, we can't guard against all of them. Also I'm unsure if something like UTF-16 is in any way compatible with the kind of legacy systems that use bytes. Can you encode your filesystem with UTF-16? I don't think you could encode a cookie with it. So maybe having a second string type in 3.x that consists of an encoded > sequence of bytes plus the encoding, call it "estr", wouldn't have been > a bad idea. It would probably have made sense to have estr cooperate > with the str type, in the same way that two different kinds of numbers > cooperate, "promoting" the result of an operation only when necessary. > This would automatically achieve the kind of polymorphic functionality > that Guido is suggesting, but without losing the ability to do > > x = e(ASCII)"bar" > a = ''.join("foo", x) > > (or whatever the syntax for such an encoded string literal would be -- > I'm not claiming this is a good one) which presume would bind "a" to a > Unicode string "foobar" -- have to work out what gets promoted to what. > I would be entirely happy without a literal syntax. But as Phillip has noted, this can't be implemented *entirely* in a library as there are some constraints with the current str/bytes implementations. Reading PEP 3003 I'm not clear if such changes are part of the moratorium? They seem like they would be (sadly), but it doesn't seem clearly noted. I think there's a *different* use case for things like bytes-in-a-utf8-encoding (e.g., to allow XML data to be decoded lazily), but that could be yet another class, and maybe shouldn't be polymorphicly usable as bytes (i.e., treat it as an optimized str representation that is otherwise semantically equivalent). A String ABC would formalize these things. -- Ian Bicking | http://blog.ianbicking.org
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com