Don't misinterpret my words or arguments here: the purpose of the question was strictly about which UTF or other transformation would be good for interoperability, and storage, and whever it would be a good idea to encode words with standard codes.

So in my view, it is completely unneeded to create such "standard" codes for common words, if these words are in the natural human language (it may make sense for computer languages, but this is specific to the implementation of such language, and should be part of its specification rather than being standardized in a general purpose encoding like Unicode code points, made to fit also all the needs for the representation of human languages, which are NOT standardized and constantly evolving.) Creating such standard codes for human words would not only be an endless task, but also a work that would rapidly become obsoleted, and not based on the very variable uses of human languages. Let's keep Unicode simple without attempting to encode words (even for Chinese, we encode "ideographic" characters, but not words made often of two characters each representing a single syllable).

If you want to encode words, you create an encoding based on a pictographic representation of human languages, and you are going to another way than the way followed for a very long history of evolution by the inventors of script systems. You would be returning to the first ages of humanity... where men had lots of difficulty to understand each other, and difficulties to transmit their acquired knowledge.

This does not exclude other UTF representation to implement algorithms, only as an intermediate form which eases the processing. However, you are not required to create an actual instance of the other UTF to work with it, and there are many examples where you can perfectly work with a compact representation that will fit marvelously in memory with excellent performance, and where the decompressed form will only be used locally.

In *many* cases, notably if the text data to manage like this is large, adding an object representation with just an API to access to a temporary decompressed form, it will improve the global performence of the system, due to reduced internal processing resource needs. A code that decompresses SCSU to UTF-32 can fit in less than 1KB of memory, but it will allow saving as many megabytes of memory as you wish for your large database, given that SCSU will take an average of nearly one byte per character (or code point) instead of 4 with UTF-32.

Such examples exist in real-world applications, notably in spelling and grammatical correctors, whose performance depend completely on the total size of the information thay have in their database, and the level at which this information is compressed (to minimize the impact on system resources, which is mostly determined by the quantity of information you can fit into fast memory without "swapping" between fast memory and slow disk storage). The most efficient correctors use very compact forms with very specific compression and indexing schemes through a transparent class managing the conversion between this compact form and the usual representation of text as a linear stream of characters.

Other examples exist in some RDBMS to allow improve the speed of query processing for large databases, or the speed of full-text searches, or in their networking connectors to reduce the bandwidth taken by result sets. The interest of data compression becomes immediate as soon as the data to process must go through any kind of channels (networking links, file storage, database table) with lower throughput than fast but expensive or restricted internal processing memory (including memory caches if we consider data locality).

From: "D. Starner" <[EMAIL PROTECTED]>
"Philippe Verdy" writes:
Suppose that Unicode encodes the common English words "the", "an", "is", etc... then a protocol
could decide that these words are not important and will filter them.

Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right
now. In fact, I suspect several library systems do drop "the", etc. right now. Not that this
makes it a good idea, but that's a lousy argument.

If such a library does this, only based on the presence of the encoded words, without wondering in which language the text is written, that kind of processing text will be seriously inefficient or inaccurate when processing other languages than English for which you will have built such a library.


For plain-text (which is what Unicode deals about), even the "an", "the", "is" words (and so on...) are equally important as other parts of the text. Encoding frequent words with a single compact code may be effective for a limited set of applications, but it will not be as much effective as a more general compression scheme (deflate, bzip2, and so on...) which will work best independantly of the language, and without needing (when impelmenting text processing functions) a arbitrarily large dictionnary for the conversion of these compact codes to the associated plain-text words encoded with streams of Unicode-supported characters.


Reply via email to