In the middle of a billion other things, I've been digging into Unicode a bit. In the process, I happened across this blog entry by Armin Ronacher:
Everything You Did Not Want to Know about Unicode in Python 3 <http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/> It's a nice gathering of a bunch of annoying issues. I've sort of been trundling along on the strength of Mark Miller's recommendation to just use the NFC normalization without really digging in to that. Yesterday, I had a chat session with Tim Čas, who pointed me at something I hadn't known. I had assumed that all of the combining marks in Unicode had corresponding combined forms. That turns out not to be true, and Tim pointed at me examples. This is very nasty thing to learn, because it means one cannot assume that UCS4 code points are self-contained. It also means that there is absolutely no hope of correctly processing *text* by viewing it as a sequence of singleton code points. And just to hammer the point home, neither NFC nor NFD are closed under concatention; there are cases where you need to rearrange marks to re-establish normalization. Note this has some nasty implications. If s = s1 + s2 (by normalizing concatenation) then in NFC: |s| <= |s1| + |s2| // as opposed to the expected simple sum s1[|s1|-1] and/or s2[0] will not necessarily appear anywhere within s. and in NFD: |s| == |s1| + |s2| All code points are retained Some code points around the point of concatenation may get reordered to preserve normalization OK. So back to my conversation with Tim, who pointed out that the absence of fully combined forms means that the entire concept of "a codepoint is kind of like a character" is broken. If that's true, *there may be no point to viewing a string as an indexable sequence of code points*. In fact, that may actually encourage programmers to be doing the wrong thing. Tim is working on an interpreter where he's contemplating defining strings as indexed over bytes, which seems to be what Go is doing. Basically he's saying that code points are already a bad abstraction, and layering another bad abstraction on top of things doesn't improve matters. Meanwhile, Armin's blog post (above) notes that there are a whole lot of circumstances where you are handed a bit of text without knowing its encoding. If you're just passing it through, the right way to think about things is that it's blind byte data. If you actually need to treat it as text in some way, then you need to know, or to be told, or to guess heuristically what the encoding is so that you can transform that text into an encoding that is common with everything else. And actually, that makes for an important side point: whatever encoding you choose, you really want all of your strings to have a common encoding. All of which brings me to the question: what, conceptually, is a string? I'm coming to the view that a string is more than a type wrapper on a codepoint vector. Putting something into a string means you think it's text, which means you should know what its encoding and normalization properties are. And if it turns out that indexing by codepoint (which is initially O(n)) isn't really convenient, then the question has to be asked: why bother? Hiding the indexing cursor doesn't seem very helpful. If the encoding used in strings is UTF-8 or UTF-16 you'll have to deal with variable length encodings in any case; hard to see a real good reason to accept O(n) indexing *in addition* to that overhead. So let me pause here and solicit input. 1. Do we all buy the story that Strings are conceptually used for text? 2. Does following set of rules for strings make sense? If no, why not? - Strings are normalized via NFC - String operations preserve NFC encoding - Strings are encoded in UTF-8 - Strings are indexed *by the byte* Look forward to hearing reactions and thoughts here. Jonathan
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
