Dear All
This is off topic, so feel free to ignore it.
The other day I was telling a co-worker about Unicode and how the UTF-8 encoding system works. During the far ranging discussions that followed (we are public servants), my co-worker suggested encoding entire words in Unicode.
This sounds like heresy to all of us who know that Unicode is meant only for characters. But wait a minute... Aren't there a whole lot of codepoints that will never be used? 231 is a big number. I imagine that it could contain all of the words of all of the languages as well as all of their characters. According to Marcus Kuhn's Unicode FAQ (http://www.cl.cam.ac.uk/~mgk25/unicode.html), "Current plans are that there will never be characters assigned outside the 21-bit code space from 0x000000 to 0x10FFFF, which covers a bit over one million potential future characters".
So here is the idea: why not use the unused part (231 - 221 = 2,145,386,496) to encode all the words of all the languages as well. You could then send any word with a few bytes. This would reduce the bandwidth necessary to send text. (You need at most six bytes to address all 231 code points, and with a knowledge of word frequencies could assign the most frequently used words to code points that require smaller numbers of bytes.) Whether text represents a significant proportion of bandwidth use is an important question, but because bandwidth = money, this idea could save quite a lot, even if text only represents a small proportion of the total bandwidth. Phone companies could use encoded words for transmitting SMS messages, thereby saving money on new mobile tower installations, although they are going to put in G3 (video-capable) anyway.
All of the machinery (Unicode, UTF-8, web crawlers that can work out what words are used most often) is already there.
Someone must have already thought of this? If not, my co-worker, Zack Alach, deserves the kudos.
Best
Tim Finney