Re: Nicest UTF

Philippe Verdy Sat, 04 Dec 2004 07:53:47 -0800

From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>

"Philippe Verdy" <[EMAIL PROTECTED]> writes:

Random access by code point index means that you don't use strings
as immutable objects,


No. Look at Python, Java and C#: their strings are immutable (don't
change in-place) and are indexed by integers (not necessarily by code
points, but it doesn't change the point).

Those strings are not indexed. They are just accessible through methods or accessors, that act *as if* they were arrays. There's nothing that requires the string storage to use the same "exposed" array, and in fact you can as well work on immutable strings, as if they were vectors of code points, or vectors of code units, and sometimes vectors of bytes.

Note for example the difference between the .length property of Java arrays, and the .length() method of java String instances...

Note also the fact that the "conversion" of an array of bytes or code units or code points to a String requires distinct constructors, and that the storage is copied rather than simply referenced (the main reason being that indexed vectors or arrays are mutable in their indexed content, but not String instances which become sharable).

Anyway, each time you use an index to access to some components of a String, the returned value is not an immutable String, but a mutable character or code unit or code point, from which you can build *other* immatable Strings (using for example mutable StringBuffers or StringBuilder or similar objects in other languages). When you do that, the returned character or code unit or code point does not guarantee that you'll build valid Unicode strings. In fact, such character-level interface is not enough to work with and transform Strings (for example it does not work to perform correct transformation of lettercase, or to manage grapheme clusters). The most powerful (and universal) transformations are those that don't use these interfaces directly, but that work on complete Strings and return complete Strings.

The character-level APIs are convenience for very basic legacy transformations, but they do not solve alone most internationalization problems; or they are used as a "protected" interface that allow building more powerful String to String transformations.

Once you realize that, which UTF you use to handle immutable String objects is not important, because it becomes part of the "blackbox" implementation of String instances. If you consider then the UTF as a blackbox, then the real arguments for an UTF or another depends on the set of String-to-String transformations you want to use (because it conditions the implmentation of these transformations), but more importantly it affects the efficiency of the String storage allocation.

For this reason, the blackbox can determine itself which UTF or internal encoding is the best to perform those transformations: the total volume of immutable string instances to handle in memory and the frequency of their instanciation determines which representation to use (because large String volumes will sollicitate the memory manager, and will seriously impact the overall application performance).

Using SCSU for such String blackbox can be a good option if this effectively helps in store many strings in a compact (for global performance) but still very fast (for transformations) representation.

Unfortunately, the immutable String implementations in Java or C# or Python does not allow the application designer to decide which representation will be the best (they are implemented as concrete classes instead of virtual interfaces with possible multiple implementations, as they should; the alternative to interfaces would have been class-level methods allowing the application to trade with the blackbox class implementation the tuning parameters).

There are other classes or libraries within which such multiple representations are possible and easily and transparently convertible from one to the other. (Note that this discussion is related to the UTF used to represent code points, but today, there are also needs to work on strings within grapheme cluster boundaries, including the various normalization forms, and a few libraries do exist for which the various normalizations can be changed without changing the "immutable" aspect of Strings, the complexity being that Strings do not always represent plain-text...)

Re: Nicest UTF

Reply via email to