In the middle of a billion other things, I've been digging into Unicode a
bit. In the process, I happened across this blog entry by Armin Ronacher:

Everything You Did Not Want to Know about Unicode in Python 3
<http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/>


It's a nice gathering of a bunch of annoying issues. I've sort of been
trundling along on the strength of Mark Miller's recommendation to just use
the NFC normalization without really digging in to that. Yesterday, I had a
chat session with Tim Čas, who pointed me at something I hadn't known. I
had assumed that all of the combining marks in Unicode had corresponding
combined forms. That turns out not to be true, and Tim pointed at me
examples.

This is very nasty thing to learn, because it means one cannot assume that
UCS4 code points are self-contained. It also means that there is absolutely
no hope of correctly processing *text* by viewing it as a sequence of
singleton code points.

And just to hammer the point home, neither NFC nor NFD are closed under
concatention; there are cases where you need to rearrange marks to
re-establish normalization. Note this has some nasty implications. If s =
s1 + s2 (by normalizing concatenation) then in NFC:

|s| <= |s1| + |s2|  // as opposed to the expected simple sum
s1[|s1|-1] and/or s2[0] will not necessarily appear anywhere within s.


and in NFD:

|s| == |s1| + |s2|
All code points are retained
Some code points around the point of concatenation may get reordered to
preserve normalization


OK. So back to my conversation with Tim, who pointed out that the absence
of fully combined forms means that the entire concept of "a codepoint is
kind of like a character" is broken. If that's true, *there may be no point
to viewing a string as an indexable sequence of code points*. In fact, that
may actually encourage programmers to be doing the wrong thing. Tim is
working on an interpreter where he's contemplating defining strings as
indexed over bytes, which seems to be what Go is doing. Basically he's
saying that code points are already a bad abstraction, and layering another
bad abstraction on top of things doesn't improve matters.

Meanwhile, Armin's blog post (above) notes that there are a whole lot of
circumstances where you are handed a bit of text without knowing its
encoding. If you're just passing it through, the right way to think about
things is that it's blind byte data. If you actually need to treat it as
text in some way, then you need to know, or to be told, or to guess
heuristically what the encoding is so that you can transform that text into
an encoding that is common with everything else.

And actually, that makes for an important side point: whatever encoding you
choose, you really want all of your strings to have a common encoding.

All of which brings me to the question: what, conceptually, is a string?
I'm coming to the view that a string is more than a type wrapper on a
codepoint vector. Putting something into a string means you think it's
text, which means you should know what its encoding and normalization
properties are.

And if it turns out that indexing by codepoint (which is initially O(n))
isn't really convenient, then the question has to be asked: why bother?
Hiding the indexing cursor doesn't seem very helpful. If the encoding used
in strings is UTF-8 or UTF-16 you'll have to deal with variable length
encodings in any case; hard to see a real good reason to accept O(n)
indexing *in addition* to that overhead.

So let me pause here and solicit input.

1. Do we all buy the story that Strings are conceptually used for text?
2. Does following set of rules for strings make sense? If no, why not?

   - Strings are normalized via NFC
   - String operations preserve NFC encoding
   - Strings are encoded in UTF-8
   - Strings are indexed *by the byte*


Look forward to hearing reactions and thoughts here.


Jonathan
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to