Re: Unicode, character ambiguities

Pablo Saratxaga Wed, 09 Jan 2002 04:16:23 -0800

Kaixo!

On Wed, Jan 09, 2002 at 05:20:11AM -0500, Glenn Maynard wrote:
> On Wed, Jan 09, 2002 at 03:18:36AM -0600, [EMAIL PROTECTED] wrote:
> > [The German Fraktur font of Latin being unreadable to English and many 
> > modern German readers.]
> 
> If modern German used this font, and I wanted to mix German and English,
> then I'd definitely want a way to make sure the right font could be used
> for both, if that's what the user wanted.


Even in a text-only, monofont appliance like the display of a VCR controler,
or a GSM phone display? Even in a road sign? Even when you handwrite the
text ?

There are cases where the concept of using different fonts for different
portions of text (depending on language or any other criteria) applies;
and other cases where that doesn't apply at all.

> The important goal is to make sure users in other languages aren't so
> annoyed with some lack within the spec that they go miles out of it to
> get what they want.

If they can choose the font they want for display, why would they be annoyed
at all?

People telling they are annoyed are in fact annoyed of the fact of the
unicode unification more than anything other; and even if that unification
never has any visible consequence in their lives they will still be annoyed.
If they had never heard of the unicode unification they would have never
noticed it.

> I'm still not certain of the exact cause of, for example, EUC-JP and
> Shift-JIS ending up in ID3V2 tags.

EUC-JP and Shift-JIS can encode *only* japanese; so, what is the difference
between encoding japanese in a japanese-only encoding and using a japanese
only font; and encoding japanese text in unucode and using a japanese font?
There is absolutely no visible difference.

That is why that proposal is nonsense.

It would be different if the proposal was for another multilingual encoding
similar to but incompatible with unicode (incompatible in that the character
set is different, not in that the ordering is different).
But even then, it would make more sense to add the missing needed chars
to unicode rather than having to wrestle for every application on earth
to support a non standard encoding.

> I'm getting the impression that
> Japanese programmers who wanted Japanese-capable editors didn't use the
> library, and ignored the UTF-8 spec more for political reasons ("I don't
> like Unicode") than practical ones.  (With Japanese encodings, you still
> can't embed some characters in a Chinese font.) Another possibility is
> that they did use the library, but the library didn't perform the
> appropriate conversions (and they couldn't be bothered to fix it to use
> an encoding they didn't like to begin with.)

The problem is that proper unicode support needs much more than simply
Japanese support. You need to handle a complex multi-byte encoding, with
also multi-width chars (while Japanese only encodings are quite simple:
only two kind of chars: ascii (1 byte, 1 column), japanese (2 bytes, 
2 columns). There is no "non spacing" chars, no combining chars, not
chars encoded in 3, 4, 5 or 6 bytes... on top of that, libraries to convert
between the various japanese encodings are around for years, they are mature
and there are lots of sample code (including real applications) using them,
and a lot of programers experienced to use them.
UTF-8 is just new world, and needs some time to mature at the same level,
and to be understood and used by programers.

That isn't limited to Japane either; encodings like iso-8859-*, koi8-* etc
are still widely used, and still the preferred encoding for a lot of people.

All that will change of course; but it is an evolution, not a revolution.
It needs time.

On the other hand, for a completly new development, it makes sense to
use unicode (utf-8 or other encoding) internally, and use a good iconv-like
library to convert to locale encoding, if needed.
So, as ogg format is quite new, it makes sense to mandate utf-8 as being the
default and *only* encodign used for all embedded text.
That will also have the extra advantage of avoiding all the encoding problems
of misinterpretating the encoding. No mojibake.

> Either way, having a stable library that does the appropriate
> conversions will probably go a long way to keeping that from happening
> again with Ogg tags.

At least in Microsoft Windows systems and in systems using GNU libc there
actually is such a library allowing to convert encodings.

-- 
Ki ça vos våye bén,
Pablo Saratxaga

http://www.srtxg.easynet.be/            PGP Key available, key ID: 0x8F0E4975

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: Unicode, character ambiguities

Reply via email to