On Sat, 8 Sep 2018 18:36:00 +0200 Mark Davis ☕️ via Unicode <[email protected]> wrote:
> I recently did some extensive revisions of a paper on Unicode string > models (APIs). Comments are welcome. > > https://docs.google.com/document/d/1wuzzMOvKOJw93SWZAqoim1VUl9mloUxE0W6Ki_G23tw/edit# Theoretically at least, the cost of indexing a big string by codepoint is negligible. For example, cost of accessing the middle character is O(1)*, not O(n), where n is the length of the string. The trick is to use a proportionately small amount of memory to store and maintain a partial conversion table from character index to byte index. For example, Emacs claims to offer O(1) access to a UTF-8 buffer by character number, and I can't significantly fault the claim. *There may be some creep, but it doesn't matter for strings that can be stored within a galaxy. Of course, the coefficients implied by big-oh notation also matter. For example, it can be very easy to forget that a bubble sort is often the quickest sorting algorithm. You keep muttering that a a sequence of 8-bit code units can contain invalid sequences, but often forget that that is also true of sequences of 16-bit code units. Do emoji now ensure that confusion between codepoints and code units rapidly comes to light? You seem to keep forgetting that grapheme clusters are not how some people people work. Does the English word 'café' contain the letter 'e'? Yes or no? I maintain that it does. I can't help thinking that one might want to look for the letter 'ă' in Vietnamese and find it whatever the associated tone mark is. You didn't discuss substrings. I'm interested in how subsequences of strings are defined, as the concept of 'substring' isn't really Unicode compliant. Again, expressing 'ă' as a subsequence of the Vietnamese word 'nặng' ought to be possible, whether one is using NFD (easier) or NFC. (And there are alternative normalisations that are compatible with canonical equivalence.) I'm most interested in subsequences X of a word W where W is the same as AXB for some strings A and B. Richard.

