Dear All

This is off topic, so feel free to ignore it.

The other day I was telling a co-worker about Unicode and how the UTF-8
encoding system works. During the far ranging discussions that followed
(we are public servants), my co-worker suggested encoding entire words
in Unicode.

This sounds like heresy to all of us who know that Unicode is meant only
for characters. But wait a minute... Aren't there a whole lot of
codepoints that will never be used? 231 is a big number. I imagine that
it could contain all of the words of all of the languages as well as all
of their characters. According to Marcus Kuhn's Unicode FAQ
(http://www.cl.cam.ac.uk/~mgk25/unicode.html), "Current plans are that
there will never be characters assigned outside the 21-bit code space
from 0x000000 to 0x10FFFF, which covers a bit over one million potential
future characters".

So here is the idea: why not use the unused part (231 - 221 =
2,145,386,496) to encode all the words of all the languages as well. You
could then send any word with a few bytes. This would reduce the
bandwidth necessary to send text. (You need at most six bytes to address
all 231 code points, and with a knowledge of word frequencies could
assign the most frequently used words to code points that require
smaller numbers of bytes.) Whether text represents a significant
proportion of bandwidth use is an important question, but because
bandwidth = money, this idea could save quite a lot, even if text only
represents a small proportion of the total bandwidth. Phone companies
could use encoded words for transmitting SMS messages, thereby saving
money on new mobile tower installations, although they are going to put
in G3 (video-capable) anyway.

All of the machinery (Unicode, UTF-8, web crawlers that can work out
what words are used most often) is already there.

Someone must have already thought of this? If not, my co-worker, Zack
Alach, deserves the kudos.

Best

Tim Finney




Reply via email to