Re: Unicode for words?

2004-12-07 Thread Doug Ewell
Richard Cook wrote: > Well, why stop with words, my lord? Why not just encode all sentences, > paragraphs, pages, chapters, books, libraries, or your higher level > unit of choice, for that matter. > ... > Whether you choose to associate a single glyph with your private-use > code point, or an en

Re: Unicode for words?

2004-12-07 Thread Richard Cook
On Dec 5, 2004, at 07:02 PM, Doug Ewell wrote: A word-based encoding for English could automatically assume spaces where they are appropriate. The sentence: "What means this, my lord?" would have seven encodable elements: the five words, the comma, and the question mark. Spaces would be automatic

Re: Unicode for words?

2004-12-05 Thread Doug Ewell
Hohberger, Clive wrote: > When I went back and recoded those same words with leading or trailing > spaces (denoted here by "_") as: "_the", "the_" "_and", "and_", etc. > as single bytes, I found a huge gain in efficiency in terms of the > number of bytes to encode the sma e English text. Next, wh

Re: Unicode for words?

2004-12-05 Thread John D. Burger
So here is the idea: why not use the unused part (231 - 221 = 2,145,386,496) to encode all the words of all the languages as well. You could then send any word with a few bytes. This would reduce the bandwidth necessary to send text. (You need at most six bytes to address all 231 code points, and

Re: Unicode for words?

2004-12-05 Thread D. Starner
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > > Drop the part of the sentence before "then". A protocol could delete "the", > > "an", etc. right > > now. In fact, I suspect several library systems do drop "the", etc. right > > now. Not that this > > makes it a good idea, but that's a lousy argu

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy
Don't misinterpret my words or arguments here: the purpose of the question was strictly about which UTF or other transformation would be good for interoperability, and storage, and whever it would be a good idea to encode words with standard codes. So in my view, it is completely unneeded to cr

RE: Unicode for words?

2004-12-05 Thread Hohberger, Clive
s represented as alphabetic strings. Clive Hohberger -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of D. Starner Sent: Sunday, December 05, 2004 11:49 AM To: [EMAIL PROTECTED] Subject: Re: Unicode for words? "Philippe Verdy" writes: > Suppose that Uni

Re: Unicode for words?

2004-12-05 Thread D. Starner
"Philippe Verdy" writes: > Suppose that Unicode encodes the common English words "the", "an", "is", > etc... then a protocol > could decide that these words are not important and will filter them. Drop the part of the sentence before "then". A protocol could delete "the", "an", etc. right now

Re: Unicode for words?

2004-12-05 Thread Philippe Verdy
From: "Ray Mullan" <[EMAIL PROTECTED]> I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages. You have misread the message from Tim: he wanted to use "code points" above U+10 wi

Re: Unicode for words?

2004-12-05 Thread Ray Mullan
I don't see how the one million available codepoints in the Unicode Standard could possibly accommodate a grammatically accurate vocabulary of all the world's languages. You're overlooking the question of which versions of words -- 'color' or 'colour' in English for instance -- would be used in

Re: Unicode for words?

2004-12-05 Thread Richard Cook
On Dec 5, 2004, at 12:27 AM, Tim Finney wrote: my co-worker suggested encoding entire words in Unicode. The "word" is considerably less well-defined than the character. The set of words is open-ended. If you'd like to see where you go when you start trying to encode words, take a look at CJK Exte

Re: Unicode for words?

2004-12-05 Thread D. Starner
"Tim Finney" <[EMAIL PROTECTED]> writes: > This would reduce the > bandwidth necessary to send text. Would it really? Ignoring all the other details (being limited to English, for one), would words that might take up to six bytes in UTF-8 really compete with the normal encoding, with most words ta

Unicode for words?

2004-12-05 Thread Tim Finney
Dear All This is off topic, so feel free to ignore it. The other day I was telling a co-worker about Unicode and how the UTF-8 encoding system works. During the far ranging discussions that followed (we are public servants), my co-worker suggested encoding entire words in Unicode. This sounds like