In a message dated 2001-11-13 4:22:59 Pacific Standard Time, [EMAIL PROTECTED] writes:
>> However, your argument that it is important to reduce the *average* length >> of encoded names, certainly doesn't apply to UTF-8 (even if it's accepted >> that it applies to ACE, which I don't accept). > > Yes, > That argument is just about justifying adding hangul syllable code block > in addition to hangul jamo (alphabet) block : 9 octets -> 3 octets " > compaction". > >> Users will never see (much less type in) UTF-8 octet string encodings >> except in obscure debugging situations. > > But, without hangul syllable block, users will suffer from > 3 times more resource consumption for a unicode hangul syllable. > 6 hangul syllables ( 6 * 3 * 3 = 54 octets ) are allowed within utf8 63 > octets limit !!!! I missed Soobok's point earlier, that he was talking about inefficient representation of jamos in UTF-8. Of course, Hangul expressed in this way does carry a significant UTF-8 performance penalty, just like other alphabetic scripts in ranges above U+0800 (including all the Indic scripts, Thai, Lao, Georgian, and kana). I have been carefully avoiding the UTF-8 vs. ACE debate, and have no intention of entering it now. Both approaches have advantages and disadvantages, and certainly one disadvantage of the UTF-8 approach is its non-optimal compaction of such scripts. However, this is not a shortcoming of UTF-8 in general, just of its use in this specific situation where space is at a premium. Remember that the original design goal of UTF-8 (as specified by Ken Thompson in 1992) simply stated that "the transformation format should not be extravagant in terms of number of bytes used for encoding." It would be difficult indeed to claim that this goal has not been met. The solution for better compaction of Hangul would appear to be to allow precomposed syllables, not merely jamos. That said, I still feel it is unproductive to claim that Hangul is "disadvantaged" or "disfavored" by UTF-8 and/or ACE as though it were the result of some kind of linguistic apartheid. ASCII, for better or worse, makes up and will continue to make up the lion's share of encoded text. Other small alphabetic scripts in common use, such as Greek, Cyrillic, Arabic, and Hebrew, were assigned codes that put them in the two-byte range of UTF-8. Small alphabetic scripts compress more easily than large syllabaries or logographic scripts. These are engineering facts, not political decisions. -Doug Ewell Fullerton, California
