[r6rs-discuss] Unicode-friendly characters

Christopher Chittleborough Thu, 24 Sep 2009 03:06:12 -0700

Several people have pointed out Unicode creates difficulties with
the idea of strings being simple vector-like sequences of characters,
because it takes a sequence of Unicode codepoints to represent
some graphemes and breaking those sequences is a Bad Thing.


There is an alternative: instead of character==codepoint, use
character==grapheme. That is, every character is a base (ie,
non-modifer) character plus zero or more modifiers, and modifiers
never appear directly in strings.

As Alaric Snell-Pym noted  at Wed, 23 Sep 2009 16:34:04 +0100,
http://www.unicode.org/reports/tr29/ "suggests that grapheme
clusters be the default notion of string length for users (and
graphemes the default notion of what a string is composed of)".

Perl 6 will support this view of strings (along with lots of
other fancy stuff which I'll ignore here -- the details can be
found at http://perlcabal.org/syn/S02.html). The Perl 6 spec
suggests the following implementation:
 - A character is represented as an integer.
 - Characters defined in Unicode are represented by their
   codepoint value. This includes precomposed characters:
   e-acute is represented by 0x00C1.
 - Other Characters are represented as indexes into an internal
   table maintained by the run-time.
 - Whenever the run-time sees a base+modifiers sequence, it
   checks whether that grapheme is a precomposed character;
   if not, the run-time adds the grapheme to its table.
If the standard allows implementations to refuse to deal with
more than (say) 2^32 different graphemes in one execution,
this allows an easy implementation of strings.

The character==codepoint approach requires programmers to be
careful when manipulating strings, and pretty much requires
library support for finding grapheme boundaries, counting
graphemes etc. On the other hand, characters are simple.

The character==grapheme approach makes strings simple and
characters complicated. You'd even need a procedure to
construct character values, something like
  (make-character BASE_CHAR LIST_OF_MODIFIERS)
which I for one find mind-boggling.
Then you need procedures to add/remove/reorder modifiers,
etc etc.

So we have an alternative to the character==codepoint approach.
Is it a good alternative? I don't know.

A good answer to that question will probably have to wait for
Perl 6 to be released and put into widespread use.
If this approach was adopted, support would have to be optional
for the sake of resource-constrained implementations.


While I'm here, let me get a few personal opinions off my chest:

Unicode is a pain in the proverbial posterior, but the only viable
alternative, ISO 2022, is a brain-devouring Cthulhu-level monstrosity.

The days of languages with one character per byte are gone. Even
16-bit characters are no longer viable. (Hello, Java programmers.
Sucks to be you.). Unless you are willing to restrict yourself to
European-and-related cultures or to mandate UTF-32 everywhere, you
have to handle variable-length encodings such as Shift-JIS (Where
"handle" should usually mean "call library code that understands".)

Variable-length encodings are only one painful aspect of Unicode.
The writing systems humans create are full of strange stuff (eg.,
uppercase/lowercase, fi ligatures), and our languages and programs
have to deal with that stuff ... or at least have libraries which
can deal with that stuff for us.

When writing code that manipulates ASCII/Latin-n/... text, it is
often just as easy to manipulate characters directly as to call
library routines. When writing code that manipulates Unicode text,
the only rational choice is to use library routines.

Cheers -- Chris[topher] Chittleborough

_______________________________________________
r6rs-discuss mailing list
[email protected]
http://lists.r6rs.org/cgi-bin/mailman/listinfo/r6rs-discuss

[r6rs-discuss] Unicode-friendly characters

Reply via email to