Joerg Schilling wrote: > Don Cragun <don.cragun at sun.com> wrote: > > >BTW: Regarding our talk... I checked the POSIX standard and it turns out > > >that od(1) support for UTF-8 "chars" is fully optional. There is no need to > > >support it. > > > > >J?rg > > > > Joerg, > > This is only partly true. > > Please also comment Rolands claim that UNICODE is not a lossless coding. > Roland mentioned this recently without giving evidence.
There wasn't enougth time during our meeting to show the problem in detail... > I can hardly believe that the 21 bit coding used by UNICODE still has problems > to map other codings. UNICODE has been designed to be a lossless coding.... ... I try to keep it short: Some encodings (e.g. ISO-2022) can define the language being used in the following characters (similar to the xml:lang="<lang>" tag in XML). Since Unicode folds some charcaters which are shared between languages to one codepoint (search for "han-unification") this information is lost[1], making Unicode not 100% lossless. Sounds trivial but it results in some unhappy&&nasty issues when the users mix text from multiple languages (one of the "harmless" things is that browsers will choose fonts based on the langauge being used - which may lead to issues like a japanese font being used for a single lonely character in the middle of an otherwise completely chinese text... and backwards... (and if you've followed the history of both countries in the last >= 1500 years you may realise that they don't like that much...)), unfortunately for languages where the matching countries are hyper-picky about their characters (note: That's an understatement). [1]=Technicially there are language-selector characters in a block outside the BMP (= Basic Multilinguar Plane) but I'm not sure whether they are really thought for this use - at least the existing converters do not use them and I can't find a standard (or even draft) which defines their usage. Or short: The situation is stuck badly in the mud. If you want the long story ask in i18n-discuss@, AFAIK Ienup can explain all the details better than I can do... ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) roland.mainz at nrubsig.org \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 7950090 (;O/ \/ \O;)
