Re: Latin-1-characters

James Mastros Tue, 16 Mar 2004 07:57:46 -0800

Karl Brodowsky wrote:

Mark J. Reed wrote:
The UTF-8 encoding is not so attractive in locales that make
heavy use of characters which require several bytes to encode therein, or
relatively little use of characters in the ASCII range;
utf-8 is fine for languages like German, Polish, Norwegian, Spanish, French,... which have >= 90% of the text with ASCII-7-bit-characters.

Add perl to that list, by the way. I rather strongly suspect that most perl code will consist mostly of 7-bit characters. (Even perl code written by traditional-Chinese-speakers (and I pick on traditional Chinese only because it has a very large character repituar -- one of the reasons there's a "simplified" variant).)

but that's why there are other encoding schemes like SCSU which get you Unicode compatibility while not taking up much more space than the locale's native charset.
These make sense for languages like Japanese, Korean, Chinese etc, where you need more than one byte per character anyway.

But Russian, Greek, Hebrew, Arabic, Armenian and Georgian would work fine with one byte per character. But the kinds of of encoding that I can

> think of both make this two bytes per character. So for these I see > file sizes doubled. Or do I miss something? Yes. You're missing the fact that SCSU is a very good encoding of Unicode. http://www.unicode.org/reports/tr6/#Examples

In general, SCSU is one byte per character, except when switching between half-blocks (that is, 0x7f contiguous characters), which take one additional byte -- except switching between a single half-block and ASCII. Thus, most of your second list of languages take one byte per character for most code, and two bytes for encoding Ť and ť. Hebrew, Greek and Arabic take one additional byte (for the whole file) to encode what half-block that the non-ASCII characters fall into. (Arabic and Cyrillic are in default blocks.)

The first list of languages is hard to predict -- it changes depending on how often you change between the different Japanese alphabets (and pseudoalphabet), for example. Their example Japanese input compresses to about 1.5 bytes per character.

(Note that SCSU is really an encoding, if it claims to be or not.)

Anyway, it will be necessary to specify the encoding of unicode in some way, which could possibly allow even to specify even some non-unicode-charsets.

By the way, there is (should be) nothing that is encodable in a non-Unicode character set that is not encodable in (any encoding of) Unicode. That's where the "uni" bit comes from. If there is, it's means that Unicode is not fulfilling it's design goals.

IMHO the OS should provide a standard way to specify such a charset as a file attribute, but usually it does not and it won't in the future, unless the file comes through the network and has a Mime-Header.

I think the answer is multi-fold.

0) Auto-detect the encoding in the compiler, if a U+FFEF signature, or a #! signature, is found at the beginning of the input. (If there is a FFEF signature, it should get thrown away after it is recognized. It may be possible to recoginze on "package" or "module" as well, and possibly even on "#".) 1) Beleive what the underling FS/OS/transport tells us. (This is likely to be a constant for many OSes, possibly selectable at the compiler's compile-time. It's the encoding on the end of the content-type for HTTP and other MIME-based transports.) 2) Support a "use encoding 'foo'" similar to that in recent perl5s: It states the encoding that the file it appears in is written in.

(the higher-numbered sources of encoding information override the former ones.)

Re: Latin-1-characters

Reply via email to