>Date: Tue, 13 May 2008 16:54:16 +0200
>From: Roland Mainz <roland.mainz at nrubsig.org>
>
>Joerg Schilling wrote:
>> Don Cragun <don.cragun at sun.com> wrote:
>> > >BTW: Regarding our talk... I checked the POSIX standard and it turns out
>> > >that od(1) support for UTF-8 "chars" is fully optional. There is no need
to
>> > >support it.
>> >
>> > >J?rg
>> >
>> > Joerg,
>> > This is only partly true.
>>
>> Please also comment Rolands claim that UNICODE is not a lossless coding.
>> Roland mentioned this recently without giving evidence.
Joerg,
In addition to the comments Roland made below, there are also a
lot of "private" character sets that contain characters (e.g., the AT&T
deathstar logo, the Sun logo, etc.) that do not appear in any ISO
standard character set. Also, just as new English words are created
every year, new ideographs appear in the languages that use ideographic
character sets. These ideographs may be used for a long time before
they are included in a UNICODE revision (and when the new ideographs
represent children's names, they may never be included).
- Don
>
>There wasn't enougth time during our meeting to show the problem in
>detail...
>
>> I can hardly believe that the 21 bit coding used by UNICODE still has
problems
>> to map other codings. UNICODE has been designed to be a lossless coding....
>
>... I try to keep it short: Some encodings (e.g. ISO-2022) can define
>the language being used in the following characters (similar to the
>xml:lang="<lang>" tag in XML). Since Unicode folds some charcaters which
>are shared between languages to one codepoint (search for
>"han-unification") this information is lost[1], making Unicode not 100%
>lossless. Sounds trivial but it results in some unhappy&&nasty issues
>when the users mix text from multiple languages (one of the "harmless"
>things is that browsers will choose fonts based on the langauge being
>used - which may lead to issues like a japanese font being used for a
>single lonely character in the middle of an otherwise completely chinese
>text... and backwards... (and if you've followed the history of both
>countries in the last >= 1500 years you may realise that they don't like
>that much...)), unfortunately for languages where the matching countries
>are hyper-picky about their characters (note: That's an understatement).
>
>[1]=Technicially there are language-selector characters in a block
>outside the BMP (= Basic Multilinguar Plane) but I'm not sure whether
>they are really thought for this use - at least the existing converters
>do not use them and I can't find a standard (or even draft) which
>defines their usage. Or short: The situation is stuck badly in the mud.
>
>If you want the long story ask in i18n-discuss@, AFAIK Ienup can explain
>all the details better than I can do...
>
>----
>
>Bye,
>Roland