On Oct 25, 2019, at 20:44, Ben Rudiak-Gould <benrud...@gmail.com> wrote: > > Nothing good can come of decomposing strings into Unicode code points.
Which is why Unicode defines extended graphemes clusters, which are the intended approximation of human-language characters, and which can be made up of multiple code points. You don’t have to worry about combining and switch units and normalization, and so on. And it is definitely possible to design a language that deals with clusters efficiently. Swift does it, for example. And they treat code units as no more fundamental than the underlying code points. (And, as you’d expect, code points and UTF-8/16/32 code units are appropriately-sized integers, not chars.) In fact, Go treats the code units as _less_ fundamental: all strings are stored as UTF-8, so you can access the code units by just casting to a byte[], but if you want to iterate the code points (which are integers), you have to import functions for that, or encode to a UTF-32 not-a-string byte[], or use some special-case magic sugar. And Rust is essentially the same (but with more low-level stuff to write your own magic sugar, instead of hardcoded magic). > The code point abstraction is practically as low level as the internal > byte encoding of the strings. Only lexing libraries should look at > strings at that level, and you should use a well written and tested > lexing library, not a hacky hand-coded lexer. OK, but how do you write that lexer? Most people should just get it off PyPI, but someone has to write it an put it on PyPI, and it has to have access to either grapheme clusters, or code units, or code points, or there’s no way it can lex anything. Unless you’re suggesting that the lexing library needs to be written in C (and then you’ll need different ones for Jython, etc.)? > Explicit access to code points should be ugly – s.__codepoints__, > maybe. And that should be a sequence of integers, not strings like > "́". Sure, but as you argued, code points are almost never what you want. And clusters don’t have a fixed size, or integer indices; how would you represent them except as a char type? >> it’s probably worth at least considering making UTF-8 strings first-class >> objects. They can’t be randomly accessed, > > They can be randomly accessed by abstract indices: objects that look > similar to ints from C code, but that have no extractable integer > value in Python code, so that they're independent of the underlying > string representation. You certainly can design a more complicated iteration/indexing protocol that handles this—C++, Swift, and Rust have exactly that, and even Python text files with seek and tell are roughly equivalent—but it’s explicitly not random access. You can only jump to a position you’ve already iterated to. For example, in Swift, you can’t write `s[..20]`, but if you write `s.first(of: ",")`, what you get back isn’t a number, it’s a String.Index, and you can use that in `s[..commaIndex]`. And a String.Index is not a random-access index, it’s only a bidirectional index—you can get the next or previous value, but you can’t get the Nth value (except in linear time, by calling next N times). Of course usually you don’t want to search for commas, you want to parse CSV or JSON or something so you don’t even care about this. But when you’re writing that CSV or JSON or whatever module, you do. > They can't be randomly accessed by code point index, but there's no > reason you should ever want to randomly access a string by a code > point index. It's a completely meaningless operation. Yes, that’s my argument for why it’s likely acceptable that they can’t be randomly accessed, which I believe was the rest of the sentence that you cut off in the middle. However, I think that goes a _tiny_ bit too far. You _usually_ don’t want to randomly access a string, but not _never_. Let’s say you’re at the REPL, and you’re doing some exploratory hacking on a big hunk of text. Being able to use the result of that str.find from a few lines earlier in a slice is often handy. So it has to be something you can read and type easily. And it’s hard to get easier to type than an int. And notice that this is exactly how seek and tell work on text files. I don’t think the benefit of being able to avoid copy-pasting some ugly thing or repeating the find outweighs the benefit of not having to think in code units—but it is still a nonzero benefit. And the fact that Swift can’t do this in its quasi-REPL is sometimes a pain. Also, Rust lets you randomly access strings by UTF-8 byte position and then ask for the next character boundary from there, which is handy for hand-optimizing searches without having to cast things back and forth to [u8]. But again, I don’t think that benefit outweighs the benefit of not having to think in either code units or code points. Anyway, once you get rid of the ability to randomly access strings by code point, this means you don’t need to store strings as UTF-32 (or as Python’s clever UTF-8-or-16-or-32). When you read a UTF-8 text file (which is most text files you read nowadays). its buffer can already be the internal storage for a string. In fact, you can even mmap a UTF-8 text file and treat it as a Unicode string. (See RipGrep, which uses Rust’s ability to do this to make regex searching large files both faster and simpler at the same time, if it’s not obvious why this is nice.) _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/BLSKUFPUZCQXFDLZR7IEI5OX3ZB6QS6K/ Code of Conduct: http://python.org/psf/codeofconduct/