On Oct 26, 2019, at 16:28, Steven D'Aprano <st...@pearwood.info> wrote: > >> On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas >> wrote: >> On Oct 13, 2019, at 12:02, Steve Jorgensen <ste...@stevej.name> wrote: > [...] >>> This proposal is a serious breakage of backward compatibility, so >>> would be something for Python 4.x, not 3.x. >> >> I’m pretty sure almost nobody wants a 3.0-like break again, so this >> will probably never happen. > > Indeed, and Guido did rule some time ago that 4.0 would be ordinary > transition, like 3.7 to 3.8, not a big backwards breaking version > change.
That _could_ change, especially if 3.9 is followed by 3.10 (or has that already been rejected?). But I think almost everyone agrees with Guido, and that’ll probably be true until the memory of 2.7 fades (a few years after Apple stops shipping it and the last Linux distros go out of LTS). I guess your 5000 implies about 16 years off, so… ok. But at that point, it makes as much sense to talk about a hypothetical new Python-like language. >> And finally, if you want to break strings, it’s probably worth at >> least considering making UTF-8 strings first-class objects. They can’t >> be randomly accessed, > > I don't see why you can't make arrays of UTF-8 indexable and provide > random access to any code point. I understand that ``str`` in > Micropython is implemented that way. Most of the time, you really don’t need random access to strings—except in the case where you got that integer index back from a the find method or a regex match object or something, in which case using Swift-style non-integer indexes, or Rust-style (and Python file object seek/tell) byte offsets, solves the problem just as well. But when you do want it, it’s very likely you don’t want it to take linear time. Providing indexing, but having it be unacceptably slow for anything but small strings, isn’t providing a useful feature, it’s providing a cruel tease. Logarithmic time is probably acceptable, but building that index takes linear time, so now constructing strings becomes slow, which is even worse (especially since it affects even strings you were never going to randomly access). > But why would you want an explicit UTF-8 string object? What benefit > do you get from exposing the fact that the implementation happens to be > UTF-8 rather than something else? (Not rhetorical questions.) For novices who only deal with UTF-8, it might mean never having to call encode or decode again. But the real benefit is to enable low-level code (that in turn makes high-level code easier to write). Have you ever written code that mmaps a text file and processes it as text? You either have to treat it as bytes and not do proper Unicode (which works for some string operations—until the first time you get some data where it doesn’t), or implement all the Unicode algorithms yourself (especially fun if what you’re trying to do is, say, a regex search), or put a buffer in front of it and decode on the fly, defeating the whole point of mmap. Have you ever read an HTTP header as bytes to verify that it’s UTF-8 and then tried to switch to using the same socket connection as a text file object rather than binary? It’s doable, but it’s a pain. And the reason all of this is a pain is that when Python (and Java and Ruby and so on) added Unicode support, the idea of assuming most files and protocols and streams are UTF-8 was ridiculous. Making UTF-8 a little easier to deal with by making everything else either slower or harder to deal with was a terrible trade off then. But in 2019—much less in Python 5000-land—that’s no longer true. > If the UTF-8 object operates on the basis of Unicode code points, then > its just a str, and the implementation is just an implementation detail. Ideally, it can iterate any of code units (bytes), code points, or grapheme clusters, not just one. Because they’re all useful at different times. But most of the string methods would be in terms of grapheme clusters. > If the UTF-8 object operates on the basis of raw bytes, with no > protection against malformed UTF-8 (e.g. allowing you to insert bytes > 0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- > or three-byte UTF-8 sequence) then its just a bytes object (or > bytearray) initialised with a UTF-8 sequence. What’s this about inserting bytes? I’m not suggesting making strings mutable; that’s insane even for 5.0. :) Anyway, it’s just a bytes object with all of the string methods, and that duck types as a string for all third-party string functions and so on, which is a lot different than “just a bytes object”. But a much better way to see it is that it’s a str object that also offers direct access to its UTF-8 bytes. Which you don’t usually need, but it is sometimes useful. And it would be more useful if things like sockets and pipes and so on had UTF-8 modes where they could just send UTF-8 strings, without you having to manually wrap them in a TextIOWrapper with non-default args first. This would require lots of changes to the stdlib and to tons of existing third-party code, to the extent that I’m not sure even “Python 5000” makes it ok, but for a new Python-inspired language, that’s a different story… > That is, as I understand it, what languages like Go do. To paraphrase, > they offer data types they *call* UTF-8 strings, except that they can > contain arbitrary bytes and be invalid UTF-8. We can already do this, > today, without the deeply misleading name: > > string.encode('utf-8') > > and then work with the bytes. I think this is even quite efficient in > CPython's "Flexible string representation". For ASCII-only strings, the > UTF-8 encoding uses the same storage as the original ASCII bytes. For > others, the UTF-8 representation is cached for later use. We had to decode it from UTF-8 and encode it back. Sure, it gets cached so we don’t have to keep doing that over and over. But leaving it as UTF-8 in the first place means we don’t have to do it at all. Of course this is only true if the source literal or text file or API or network protocol or whatever was encoded in UTF-8. But most of them are. (For the rest, yes, we still have to decode from UTF-16-LE or Shift-JIS or cp1252 or whatever and re-encode as UTF-8—albeit with a minor shortcut for the first example. But that’s no worse than today, and it’s getting less common all the time anyway.) > So I don't see any advantage to this UTF-8 object. If the API works on > code points, then it's just an implementation detail of str; if the API > works on code units, that's just a fancy name for bytes. We already have > both str and bytes so what is the purpose of this utf8 object? Since we’re now talking 5000 rather than 4000, this could replace str rather than be in addition to it. And it would also replace many uses of bytes. People would still need bytes when they want a raw buffer of something that isn’t text, and when they want a buffer of something that’s not known to be UTF-8 (like the HTTP example–you start with bytes, then switch to utf8 once you know the encoding is utf8 or stick a stream decoder in front of it if it turns out not to be), but when you want a buffer of encoded text, the string is the buffer. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/DMLQGUCNFVM4SFRRBFOYOQVXYF5NH3EL/ Code of Conduct: http://python.org/psf/codeofconduct/