Re: [Python-Dev] thoughts on the bytes/string discussion

Ian Bicking Thu, 24 Jun 2010 12:51:45 -0700

On Thu, Jun 24, 2010 at 12:38 PM, Bill Janssen <jans...@parc.com> wrote:


> Here are a couple of ideas I'm taking away from the bytes/string
> discussion.
>
> First, it would probably be a good idea to have a String ABC.
>
> Secondly, maybe the string situation in 2.x wasn't as broken as we
> thought it was.  In particular, those who deal with lots of encoded
> strings seemed to find it handy, and miss it in 3.x.  Perhaps strings
> are more like numbers than we think.  We have separate types for int,
> float, Decimal, etc.  But they're all numbers, and they all
> cross-operate.  In 2.x, it seems there were two missing features: no
> encoding attribute on str, which should have been there and should have
> been required, and the default encoding being "ASCII" (I can't tell you
> how many times I've had to fix that issue when a non-ASCII encoded str
> was passed to some output function).
>

I've started to form a conceptual notion that I think fits these cases.

We've setup a system where we think of text as natively unicode, with
encodings to put that unicode into a byte form.  This is certainly
appropriate in a lot of cases.  But there's a significant class of problems
where bytes are the native structure.  Network protocols are what we've been
discussing, and are a notable case of that.  That is, b'/' is the most
native sense of a path separator in a URL, or b':' is the most native sense
of what separates a header name from a header value in HTTP.  To disallow
unicode URLs or unicode HTTP headers would be rather anti-social, especially
because unicode is now the "native" string type in Python 3 (as an aside for
the WSGI spec we've been talking about using "native" strings in some
positions like dictionary keys, meaning Python 2 str and Python 3 str, while
being more exacting in other areas such as a response body which would
always be bytes).

The HTTP spec and other network protocols seems a little fuzzy on this,
because it was written before unicode even existed, and even later activity
happened at a point when "unicode" and "text" weren't widely considered the
same thing like they are now.  But I think the original intention is
revealed in a more modern specification like WebSockets, where they are very
explicit that ':' is just shorthand for a particular byte, it is not "text"
in our new modern notion of the term.

So with this idea in mind it makes more sense to me that *specific pieces of
text* can be reasonably treated as both bytes and text.  All the string
literals in urllib.parse.urlunspit() for example.

The semantics I imagine are that special('/')+b'x'==b'/x' (i.e., it does not
become special('/x')) and special('/')+x=='/x' (again it becomes str).  This
avoids some of the cases of unicode or str infecting a system as they did in
Python 2 (where you might pass in unicode and everything works fine until
some non-ASCII is introduced).

The one place where this might be tricky is if you have an encoding that is
not ASCII compatible.  But we can't guard against every possibility.  So it
would be entirely wrong to take a string encoded with UTF-16 and start to
use b'/' with it.  But there are other nonsensical combinations already
possible, especially with polymorphic functions, we can't guard against all
of them.  Also I'm unsure if something like UTF-16 is in any way compatible
with the kind of legacy systems that use bytes.  Can you encode your
filesystem with UTF-16?  I don't think you could encode a cookie with it.

So maybe having a second string type in 3.x that consists of an encoded
> sequence of bytes plus the encoding, call it "estr", wouldn't have been
> a bad idea.  It would probably have made sense to have estr cooperate
> with the str type, in the same way that two different kinds of numbers
> cooperate, "promoting" the result of an operation only when necessary.
> This would automatically achieve the kind of polymorphic functionality
> that Guido is suggesting, but without losing the ability to do
>
>  x = e(ASCII)"bar"
>  a = ''.join("foo", x)
>
> (or whatever the syntax for such an encoded string literal would be --
> I'm not claiming this is a good one) which presume would bind "a" to a
> Unicode string "foobar" -- have to work out what gets promoted to what.
>

I would be entirely happy without a literal syntax.  But as Phillip has
noted, this can't be implemented *entirely* in a library as there are some
constraints with the current str/bytes implementations.  Reading PEP 3003
I'm not clear if such changes are part of the moratorium?  They seem like
they would be (sadly), but it doesn't seem clearly noted.

I think there's a *different* use case for things like
bytes-in-a-utf8-encoding (e.g., to allow XML data to be decoded lazily), but
that could be yet another class, and maybe shouldn't be polymorphicly usable
as bytes (i.e., treat it as an optimized str representation that is
otherwise semantically equivalent).  A String ABC would formalize these
things.

-- 
Ian Bicking  |  http://blog.ianbicking.org

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Re: [Python-Dev] thoughts on the bytes/string discussion

Reply via email to