Re: Nicest UTF

Philippe Verdy Sun, 05 Dec 2004 08:03:12 -0800

----- Original Message ----- From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, December 05, 2004 1:37 AM Subject: Re: Nicest UTF

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

There's nothing that requires the string storage to use the same
"exposed" array,

The point is that indexing should better be O(1).

SCSU is also O(1) in terms of indexing complexity... simply because it keeps the exact equivalence with codepoints, and requires a *fixed* (and small) number of steps to decode it to code points, but also because the decoder states uses a *fixed* (and small) number of variables for the internal context (unlike more powerful compression algorithms like dictionnary-based, Lempel-Ziv-Welsh-like, algorithms such as deflate).

Not having a constant side per code point requires one of three things:
1. Using opaque iterators instead of integer indices.
2. Exposing a different unit in the API.
3. Living with the fact that indexing is not O(1) in general; perhaps
  with clever caching it's good enough in common cases.

Altough all three choices can work, I would prefer to avoid them.
If I had to, I would probably choose 1. But for now I've chosen a
representation based on code points.

Anyway, each time you use an index to access to some components of a
String, the returned value is not an immutable String, but a mutable
character or code unit or code point, from which you can build
*other* immatable Strings

No, individual characters are immutable in almost every language.

But individual characters do not always have any semantic. For languages, the relevant unit is almost always the grapheme cluster, not the character (so not its code point...). As grapheme clusters need to be represented on variable lengths, an algorithm that could only work with fixed-width units would not work internationaly or would cause serious problems for correct analysis or transformation of true languages.

Assignment to a character variable can be thought as changing the
reference to point to a different character object, even if it's
physically implemented by overwriting raw character code.

When you do that, the returned character or code unit or code point
does not guarantee that you'll build valid Unicode strings. In fact,
such character-level interface is not enough to work with and
transform Strings (for example it does not work to perform correct
transformation of lettercase, or to manage grapheme clusters).


This is a different issue. Indeed transformations like case mapping
work in terms of strings, but in order to implement them you must
split a string into some units of bounded size (code points, bytes,
etc.).

Yes, but why do you want that this intermediate unit be the code point? Such algorithm can be developped with any UTF, or even with compressed encoding schemes through accessor or enumerator methods...

All non-trivial string algorithms boil down to working on individual
units, because conditionals and dispatch tables must be driven by
finite sets. Any unit of a bounded size is technically workable, but
they are not equally convenient. Most algorithms are specified in
terms of code points, so I chose code points for the basic unit in
the API.

"Most" is the right term here: this is not a requirement, and it's not because it is the simplest way to implement such algorithm that it will be the most efficient in terms of performance or resource allocations. Most experiences prove that the most efficient algorithms are also complex to implement.

Code points are probably the easiest thing to describe what an text algorithm is supposed to do, but this is not a requirement for applications (in fact many libraries have been written that correctly implement the Unicode algorithms, without even dealing with code points, but only with in-memory code units of UTF-16 or even in UTF-8 or GB18030, or directly with serialization bytes of UTF-16LE or UTF-8 or SCSU or ether encoding schemes).

Which represent will be the best is left to implementers, but I really think that compressed schemes are often introduced to increase the application performances and reduce the needed resources both in memory and for I/O, but also in networking where interoperability across systems and bandwidth optimization are also important design goals...

Re: Nicest UTF

Reply via email to