Re: Strings Manifesto

Larry Wall Wed, 28 Apr 2004 10:59:38 -0700

All in all, very well written.  I do, of course, have a few quibbles:

On Wed, Apr 28, 2004 at 04:22:07AM -0700, Jeff Clites wrote:
: As it turns out, people find it convenient to programmatically represent a 
: character by an integer (think "whole number", not a specific data type 
: here).


After being so careful to define "character" abstractly, this whole
passage misleads the reader into believing that any such abstract
character can be represented by a single integer (code point).
Only a subset of characters can be represented by a single code point.
Many characters require multiple code points.  I see this as a critical
point--it's at the one-to-many interfaces that things tend to break,
and that's precisely why Perl 6 has the four abstraction levels it does:

    Level 0: bytes
    Level 1: codepoints don't fit into bytes
    Level 2: graphemes don't fit into codepoints
    Level 3: characters don't fit into graphemes

(where I've used the term "characters" in the language-sensitive sense.)

Not making this distinction also causes you to leave out a level
of collation:

    Level 0: binary sorting
    Level 1: codepoint sorting
    Level 2: language-independent grapheme sorting (UCA)
    Level 3: UCA plus tailorings

: It's convenient for several reasons--it's compact and easy to 
: refer to in speech. And if the fundamental thing you can ask a string 
: is what its Nth character is, then the fundamental things you do with a 
: character is look up its properties, and test it for equality against 
: other characters. So if you just go through and give each character a 
: little serial number, then you can find the properties of a character 
: by using its number as an index into a property table (i.e., character 
: 3's properties are at slot number 3 in the table), and you can tell 
: that 2 characters are different characters by checking whether they are 
: represented by different numbers.

But this is really only true of codepoints, not of graphemes or
characters.  I realize that oversimplifying is a useful pedagogical
technique, but when you do that you ought to "unlie" in the same
document somewhere.  (I'll grant you that you promise to unlie in
your final paragraph, kinda sorta.)

: Fortunately, the Unicode Standard has numbered *all* of them--it's 
: given a number to essentially every character in every 
: digitally-represented langauge in the world.

Um, no--not unless you've defined how to multiplex the multiple
integers of the codepoints in a grapheme into a single integer, and
I haven't heard that the Unicode consortium has come up with such
a definition.

: So, let's review again. For various practical reasons, it's preferable 
: to programatically represent characters using integers, you have to 
: pick an arbitrary numbering scheme, and somebody's done that, and it's 
: a good one. This numbering scheme defines a one-to-one correspondence 
: between numbers (code points) and characters,

There you go again.  You need to settle on one definition of character
or the other.  I kind of like the abstract definition, but that's not
how you're using it here.

: and that makes it 
: tempting to pretend that characters *are* numbers. But it's important 
: to keep in the back of your mind an awareness that the numbers merely 
: help you pick out the characters, and it's the characters themselves 
: which are important, and characters are *abstract*--they never actually 
: live inside of a computer program.

Cain't have it both ways...

[Note: Of course, some numbers don't 
: represent any character--there are only so many characters. So to be 
: mathematically precise, there's a one-to-one correspondence between a 
: subset of integers and all characters.]

And many characters are not represented by any integer, but by a sequence
of integers.

: Also, importantly, a grapheme cluster is a notion built on top of 
: characters (it's a cluster of characters), and choosing a langauge lets 
: you refine how you break up a string into grapheme clusters, but it's 
: just a refinement--"adding a language into the mix" doesn't pick out a 
: different semantic construct, it just help you customize your choice of 
: what ranges make up single graphemes.

I'd say a grapheme cluster functions as a "character" by your original
definition, so this is another case where you're using "character"
to mean something less than that.  Also the last sentence seems to
be calling a grapheme cluster a grapheme, which is confusing.  A grapheme
cluster is a cluster of graphemes, kinda by definition...

: I haven't yet covered a few important topics, such as different 
: character sequences representing equivalent graphemes, canonical and 

s/character/codepoint/

: compatability equivalence, and Unicode normalization forms. I also 
: haven't said anything yet about concrete implementation or API 
: guidelines.

I await your coverage of those topics with interest.

Larry

Re: Strings Manifesto

Reply via email to