> Message du 04/06/10 18:30 > De : "Doug Ewell" <d...@ewellic.org> > A : "Mark Davis ☕" <m...@macchiato.com> > Copie à : unicode@unicode.org, "Otto Stolz" <otto.st...@uni-konstanz.de> > Objet : RE: Least used parts of BMP. > > > Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto > dot Stolz at uni dash konstanz dot de>: > > >> The problem with this encoding is that the trailing bytes > >> are not clearly marked: they may start with any of > >> '0', '10', or '110'; only '111' would mark a byte > >> unambiguously as a trailing one. > >> > >> In contrast, in UTF-8 every single byte carries a marker > >> that unambiguously marks it as either a single ASCII byte, > >> a starting, or a continuation byte; hence you have not to > >> go back to the beginning of the whole data stream to recognize, > >> and decode, a group of bytes. > > > > In a compression format, that doesn't matter; you can't expect random > > access, nor many of the other features of UTF-8. > > That said, if Kannan were to go with the alternative format suggested on > this list: > > 0xxxxxxx > 1xxxxxxx 0yyyyyyy > 1xxxxxxx 1yyyyyyy 0zzzzzzz > > then he would at least have this one feature of UTF-8, at no additional > cost in bits compared to the format he is using today. > > Of course, he will not have other UTF-8-like features, such as avoidance > of ASCII values in the final trail byte, and "fast forward parsing" by > looking at the first byte.
The fast forward feature is certianly not decisive, but the random acessibility (from any position and in any direction) is certainly much more decisive and is a real positive factor for UTF-8, rather than the format proposed above, which can only be read in the forward direction, even if it can be accessed randomly to find the *next* character. to find the *previous* one, you have to scan backward until you eat at least one byte used to encode the character before it (otherwise, you don't know if a 1xxxxxx byte is the first one in a sequence, even if you can know if a byte is the last one. > He may not care. One thing I've noted about > descriptions of UTF-8, in the context of alternative formats for private > protocols, is that they always assume these features are important to > everyone, when they may not be. One decisive factor that has favored UTF-8 is that is is fully compatible with ASCII and that all ASCII values are used exclusively as single bytes to encode ASCII only, and not any trailer byte. This is what makes UTF-8 compatible with MIME and work exactly like the ISO 8859-* series (including when it was converted bijectively between ISO-8859-1 and extended EBCDIC, when just ignoring which exact ISO 8859 code page was used). One consequence is that characters are preserved, even if linewraps need to be changed (keeping the C0 controls and SPACE unaffected, and preserving characters that are essential for lots of protocols, including digits, some punctuation, and the compatibility with lettercase mappings in the ASCII subspace). Another working encoding that would preserve the MIME compatibility and ASCII, and and bidirectional random accesses would be: - 0zzzzzz : encodes all 2^7 code points, from U+0000 to U+007F (with substracted offset=0) - 11yyyyyy 10zzzzzz : encodes all 2^12 code points, from U+0080 to U+107F (with substracted offset=0x0080) - 11xxxxxx 11yyyyyy 10zzzzzz : encodes all 2^18 code points, from U+1080 to U+04107F (with substracted offset=0x1080) - 11****vv 11xxxxxx 11yyyyyy 10zzzzzz (where * is an unused bit) : encodes about 76% at the start of 2^20 theoretical code points, from U+041080 to U+10FFFF (with substracted offset=0x41080) It would be a little more compact than UTF-8 for a larger subset of Unicode code points, and could even be compacted more than UTF-8, so that each sequence length does not overlap the next code space (as shown above), or by even compacting the surrogates code point space. You could also decide to drop the compatibility with the binary ordering of code points ( But all these "improvements" will make very little difference (compared to existing UTF-8) in terms of compression for Latin, Greek, Cyrillic, Semtic and Indic scripts , and not even for ideographs whose most of them won't fit in the shorter sequences. You could also decide to "compress" the encoding space used by Korean, by transforming the large block of precomposed syllables into their canonically equivalent jamos (and possibly using an internal separator/disambiguator used only in the encoded form but not mapped itself to any Unicode character/code point, when needed to unify the encoding of leading and trailing consonnants, so that Korean will be encoded as a simple alphabet in a very small block : you could place this small block within the unused code space of surrogates, and you could also reorder the various blocks (notably hiragana, katakana, bopomofo, and South-East Asian scripts) so that they will be in the shorter sequences, and that you'll be encoding the less used characters (geometric symbols, block/line drawing, maths symbols, windings) into higher positions. You an invent many variants like this, according to your language needs, and then you'll reinvent the various national Asian charsets that are compatible with MIME and can be made conforming to Unicode (like GB18030 used in P.R.China)... So do you really want to do that ? May be encoding Unicode with GB81030 (or the newest versions of HKCS, KSC and JIS character encoding standards) is your immediate solution, and there's nothing to redevelop now as it is already implemented and widely available as an alternative to UTF-8... But if your need is to support some non major scripts (like Georgian), consider the fact that supporting these encoders will cost you more than just having to use today UTF-8 which is supported now everywhere. The cost for the extra storage space for these scripts (today, storage and even transmission are no longer a problem : any generic compression algorithms already work perfectly when they are used on top of an internal UTF-8 encoding for storage and networking, and UTF-16 or UTF-32 in memory for local processing of small amounts of texts up to several megabytes) is much less than the cost of adapting and maintaining systems supporting these specialized encodings (even if they are made compatible and conforming to Unicode so that they can represent all valid code points and preserve, at least, all the canonical equivalences).