On Fri, Oct 25, 2019 at 08:44:17PM -0700, Ben Rudiak-Gould wrote: > Nothing good can come of decomposing strings into Unicode code points.
Sure there is. In Python, it's the fastest way to calculate the digit sum of an integer. It's also useful for implementing classical encryption algorithms, like Playfair. Introspection, e.g. if I want to know if a string contains any surrogates, I can do this: any('\uD800' <= c <= '\uDFFF' for c in s) Of perhaps I want to know if the string contains any "astral characters", in which case they aren't safe to pass to a Javascript or Tcl script which doesn't handle them correctly: any(c > '\uFFFF' for c in s) How about education? One of the things I can do with strings is: for c in string: print(unicodedata.name(c)) or possible even just # what is that weird symbol in position five? print(unicodedata.name(string[5])) to find out what that weird character is called, so I can look it up and find out what it means. Knowing stuff is good, right? Or do you think the world would be better off if it was really hard and "ugly" (your word) for people like me to find out what code points are called and what their meaning is? Rather than just telling us that we shouldn't be allowed to access code points in strings, would you please be explicit about *why* this access is a bad thing? And if code points are "bad", then what should we be allowed to do with strings? If code points is too low level, then what is an appropriate level? I guess you're probably going to mention grapheme clusters. (If you aren't, then I have no idea what your objection is based on.) Grapheme clusters are a hard problem to solve, since they are dependent on the language and the locale. There's a Unicode algorithm for splitting on graphemes, but it ignores the locale differences. Processing on graphemes is more expensive than on code points. There is, as far as I can tell, no O(1) access to graphemes in a string without pre-processing them and keeping a list of their indices. For many people, and for many purposes, paying that extra cost in either time or memory is just a total waste, since they're hardly ever going to come across a grapheme cluster. Few people have to process completely arbitrary strings: their data tends to come from a particular subset of natural language strings, and for some such languages, you might go a whole lifetime without coming across a grapheme cluster of more than one code point. (This may be slowly changing, even for American English, driven in part by the use of emoji and variation selectors.) If Python came with a grapheme processing API, I would probably use it. But in the meantime, the code point API is "good enough" for most things I do with strings. And for the rest, graphemes are too low-level: I need things like sentences; clauses, words, word stems, prefixes and suffixes, syllables etc. But even if Python had an excellent, fast grapheme API, I would still want a nice, clean, fast interface that operates on code-points. -- Steven _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/OCG64OW4WPVDFUSN3R7AGI6M4NFKGJIP/ Code of Conduct: http://python.org/psf/codeofconduct/