On Oct 25, 2019, at 20:44, Ben Rudiak-Gould <benrud...@gmail.com> wrote:
> 
> Nothing good can come of decomposing strings into Unicode code points.

Which is why Unicode defines extended graphemes clusters, which are the 
intended approximation of human-language characters, and which can be made up 
of multiple code points. You don’t have to worry about combining and switch 
units and normalization, and so on.

And it is definitely possible to design a language that deals with clusters 
efficiently. Swift does it, for example. And they treat code units as no more 
fundamental than the underlying code points. (And, as you’d expect, code points 
and UTF-8/16/32 code units are appropriately-sized integers, not chars.)

In fact, Go treats the code units as _less_ fundamental: all strings are stored 
as UTF-8, so you can access the code units by just casting to a byte[], but if 
you want to iterate the code points (which are integers), you have to import 
functions for that, or encode to a UTF-32 not-a-string byte[], or use some 
special-case magic sugar. And Rust is essentially the same (but with more 
low-level stuff to write your own magic sugar, instead of hardcoded magic).

> The code point abstraction is practically as low level as the internal
> byte encoding of the strings. Only lexing libraries should look at
> strings at that level, and you should use a well written and tested
> lexing library, not a hacky hand-coded lexer.

OK, but how do you write that lexer? Most people should just get it off PyPI, 
but someone has to write it an put it on PyPI, and it has to have access to 
either grapheme clusters, or code units, or code points, or there’s no way it 
can lex anything. Unless you’re suggesting that the lexing library needs to be 
written in C (and then you’ll need different ones for Jython, etc.)?

> Explicit access to code points should be ugly – s.__codepoints__,
> maybe. And that should be a sequence of integers, not strings like
> "́".

Sure, but as you argued, code points are almost never what you want. And 
clusters don’t have a fixed size, or integer indices; how would you represent 
them except as a char type?

>> it’s probably worth at least considering making UTF-8 strings first-class 
>> objects. They can’t be randomly accessed,
> 
> They can be randomly accessed by abstract indices: objects that look
> similar to ints from C code, but that have no extractable integer
> value in Python code, so that they're independent of the underlying
> string representation.

You certainly can design a more complicated iteration/indexing protocol that 
handles this—C++, Swift, and Rust have exactly that, and even Python text files 
with seek and tell are roughly equivalent—but it’s explicitly not random 
access. You can only jump to a position you’ve already iterated to.

For example, in Swift, you can’t write `s[..20]`, but if you write `s.first(of: 
",")`, what you get back isn’t a number, it’s a String.Index, and you can use 
that in `s[..commaIndex]`. And a String.Index is not a random-access index, 
it’s only a bidirectional index—you can get the next or previous value, but you 
can’t get the Nth value (except in linear time, by calling next N times).

Of course usually you don’t want to search for commas, you want to parse CSV or 
JSON or something so you don’t even care about this. But when you’re writing 
that CSV or JSON or whatever module, you do.

> They can't be randomly accessed by code point index, but there's no
> reason you should ever want to randomly access a string by a code
> point index. It's a completely meaningless operation.

Yes, that’s my argument for why it’s likely acceptable that they can’t be 
randomly accessed, which I believe was the rest of the sentence that you cut 
off in the middle.

However, I think that goes a _tiny_ bit too far. You _usually_ don’t want to 
randomly access a string, but not _never_.

Let’s say you’re at the REPL, and you’re doing some exploratory hacking on a 
big hunk of text. Being able to use the result of that str.find from a few 
lines earlier in a slice is often handy. So it has to be something you can read 
and type easily. And it’s hard to get easier to type than an int. And notice 
that this is exactly how seek and tell work on text files. I don’t think the 
benefit of being able to avoid copy-pasting some ugly thing or repeating the 
find outweighs the benefit of not having to think in code units—but it is still 
a nonzero benefit. And the fact that Swift can’t do this in its quasi-REPL is 
sometimes a pain.

Also, Rust lets you randomly access strings by UTF-8 byte position and then ask 
for the next character boundary from there, which is handy for hand-optimizing 
searches without having to cast things back and forth to [u8]. But again, I 
don’t think that benefit outweighs the benefit of not having to think in either 
code units or code points.

Anyway, once you get rid of the ability to randomly access strings by code 
point, this means you don’t need to store strings as UTF-32 (or as Python’s 
clever UTF-8-or-16-or-32). When you read a UTF-8 text file (which is most text 
files you read nowadays). its buffer can already be the internal storage for a 
string. In fact, you can even mmap a UTF-8 text file and treat it as a Unicode 
string. (See RipGrep, which uses Rust’s ability to do this to make regex 
searching large files both faster and simpler at the same time, if it’s not 
obvious why this is nice.)

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/BLSKUFPUZCQXFDLZR7IEI5OX3ZB6QS6K/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to