From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Now consider scanning forwards. We want to strip a beginning of a
string. For example the string is an irc message prefixed with a
command and we want to take the message only for further processing.
We have found the end of the prefix and we want to produce a string
from this position to the end (a copy, since strings are immutable).

All those are not demonstration: decoding IRC commands or similar things does not constitute the need to encode large sets of texts. In your examples, you show applications that need to handle locally some strings made for computer languages.


Texts of human languages, or even a collection of person names, or places are not like this, and have a much wider variety, but with huge possibilities for data compression (inherent to the phonology of human languages and their overall structure, but also due to repetitive conventions spread throughout the text to allow easier reading and understanding).

Scanning backward a person name or human text is possibly needed locally, but such text has a strong forward directionality without which it does not make sense. Same thing if you scan such text starting at random positions: you could make many false interpretations of this text by extracting random fragments like this.

Anyway, if you have a large database of texts to process or even to index, you will, in fine, need to scan this text linearily first from the beginning to the end, should it be only to create an index for accessing it later randomly. You will still need to store the indexed text somewhere, and in order to maximize the performance, or responsiveness of your application, you'll need to minimize its storage: that's where compression takes place. This does not change the semantic of the text, does not remove its semantics, but this is still an optimization, which does not prevent a further access with more easily parsable representation as stateless streams of characters, through surjective (sometimes bijective) converters between the compressed and uncompressed forms.

My conclusion: there's no "best" representation to fit all needs. Each representation has its merits in its domain. The Unicode UTFs are excellent only for local processing of limited texts, but they are not necessarily the best for long term storage or for large text sets.

And even for texts that will be accessed frequently, compressed schemes can still constitute optimizations, even if these texts need to be decompressed repeatedly each time they are needed. I am clearly against the arguments with "one scheme fits all needs", even if you think that UTF-32 is the only viable long-term solution.




Reply via email to