On Oct 26, 2019, at 16:28, Steven D'Aprano <st...@pearwood.info> wrote:
> 
>> On Sun, Oct 13, 2019 at 12:41:55PM -0700, Andrew Barnert via Python-ideas 
>> wrote:
>> On Oct 13, 2019, at 12:02, Steve Jorgensen <ste...@stevej.name> wrote:
> [...]
>>> This proposal is a serious breakage of backward compatibility, so 
>>> would be something for Python 4.x, not 3.x.
>> 
>> I’m pretty sure almost nobody wants a 3.0-like break again, so this 
>> will probably never happen.
> 
> Indeed, and Guido did rule some time ago that 4.0 would be ordinary 
> transition, like 3.7 to 3.8, not a big backwards breaking version 
> change.

That _could_ change, especially if 3.9 is followed by 3.10 (or has that already 
been rejected?). But I think almost everyone agrees with Guido, and that’ll 
probably be true until the memory of 2.7 fades (a few years after Apple stops 
shipping it and the last Linux distros go out of LTS). I guess your 5000 
implies about 16 years off, so… ok. But at that point, it makes as much sense 
to talk about a hypothetical new Python-like language.

>> And finally, if you want to break strings, it’s probably worth at 
>> least considering making UTF-8 strings first-class objects. They can’t 
>> be randomly accessed, 
> 
> I don't see why you can't make arrays of UTF-8 indexable and provide 
> random access to any code point. I understand that ``str`` in 
> Micropython is implemented that way.

Most of the time, you really don’t need random access to strings—except in the 
case where you got that integer index back from a the find method or a regex 
match object or something, in which case using Swift-style non-integer indexes, 
or Rust-style (and Python file object seek/tell) byte offsets, solves the 
problem just as well.

But when you do want it, it’s very likely you don’t want it to take linear 
time. Providing indexing, but having it be unacceptably slow for anything but 
small strings, isn’t providing a useful feature, it’s providing a cruel tease. 
Logarithmic time is probably acceptable, but building that index takes linear 
time, so now constructing strings becomes slow, which is even worse (especially 
since it affects even strings you were never going to randomly access).

> But why would you want an explicit UTF-8 string object? What benefit 
> do you get from exposing the fact that the implementation happens to be 
> UTF-8 rather than something else? (Not rhetorical questions.)

For novices who only deal with UTF-8, it might mean never having to call encode 
or decode again. But the real benefit is to enable low-level code (that in turn 
makes high-level code easier to write).

Have you ever written code that mmaps a text file and processes it as text? You 
either have to treat it as bytes and not do proper Unicode (which works for 
some string operations—until the first time you get some data where it 
doesn’t), or implement all the Unicode algorithms yourself (especially fun if 
what you’re trying to do is, say, a regex search), or put a buffer in front of 
it and decode on the fly, defeating the whole point of mmap.

Have you ever read an HTTP header as bytes to verify that it’s UTF-8 and then 
tried to switch to using the same socket connection as a text file object 
rather than binary? It’s doable, but it’s a pain.

And the reason all of this is a pain is that when Python (and Java and Ruby and 
so on) added Unicode support, the idea of assuming most files and protocols and 
streams are UTF-8 was ridiculous. Making UTF-8 a little easier to deal with by 
making everything else either slower or harder to deal with was a terrible 
trade off then. But in 2019—much less in Python 5000-land—that’s no longer true.

> If the UTF-8 object operates on the basis of Unicode code points, then 
> its just a str, and the implementation is just an implementation detail.

Ideally, it can iterate any of code units (bytes), code points, or grapheme 
clusters, not just one. Because they’re all useful at different times. But most 
of the string methods would be in terms of grapheme clusters.

> If the UTF-8 object operates on the basis of raw bytes, with no 
> protection against malformed UTF-8 (e.g. allowing you to insert bytes 
> 0x80-0xFF which are never valid in UTF-8, or by splitting apart a two- 
> or three-byte UTF-8 sequence) then its just a bytes object (or 
> bytearray) initialised with a UTF-8 sequence.

What’s this about inserting bytes? I’m not suggesting making strings mutable; 
that’s insane even for 5.0. :)

Anyway, it’s just a bytes object with all of the string methods, and that duck 
types as a string for all third-party string functions and so on, which is a 
lot different than “just a bytes object”.

But a much better way to see it is that it’s a str object that also offers 
direct access to its UTF-8 bytes. Which you don’t usually need, but it is 
sometimes useful. And it would be more useful if things like sockets and pipes 
and so on had UTF-8 modes where they could just send UTF-8 strings, without you 
having to manually wrap them in a TextIOWrapper with non-default args first.

This would require lots of changes to the stdlib and to tons of existing 
third-party code, to the extent that I’m not sure even “Python 5000” makes it 
ok, but for a new Python-inspired language, that’s a different story…

> That is, as I understand it, what languages like Go do. To paraphrase, 
> they offer data types they *call* UTF-8 strings, except that they can 
> contain arbitrary bytes and be invalid UTF-8. We can already do this, 
> today, without the deeply misleading name:
> 
>    string.encode('utf-8')
> 
> and then work with the bytes. I think this is even quite efficient in 
> CPython's "Flexible string representation". For ASCII-only strings, the 
> UTF-8 encoding uses the same storage as the original ASCII bytes. For 
> others, the UTF-8 representation is cached for later use.

We had to decode it from UTF-8 and encode it back. Sure, it gets cached so we 
don’t have to keep doing that over and over. But leaving it as UTF-8 in the 
first place means we don’t have to do it at all.

Of course this is only true if the source literal or text file or API or 
network protocol or whatever was encoded in UTF-8. But most of them are. (For 
the rest, yes, we still have to decode from UTF-16-LE or Shift-JIS or cp1252 or 
whatever and re-encode as UTF-8—albeit with a minor shortcut for the first 
example. But that’s no worse than today, and it’s getting less common all the 
time anyway.)

> So I don't see any advantage to this UTF-8 object. If the API works on
> code points, then it's just an implementation detail of str; if the API 
> works on code units, that's just a fancy name for bytes. We already have 
> both str and bytes so what is the purpose of this utf8 object?

Since we’re now talking 5000 rather than 4000, this could replace str rather 
than be in addition to it. 

And it would also replace many uses of bytes. People would still need bytes 
when they want a raw buffer of something that isn’t text, and when they want a 
buffer of something that’s not known to be UTF-8 (like the HTTP example–you 
start with bytes, then switch to utf8 once you know the encoding is utf8 or 
stick a stream decoder in front of it if it turns out not to be), but when you 
want a buffer of encoded text, the string is the buffer.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/DMLQGUCNFVM4SFRRBFOYOQVXYF5NH3EL/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to