Re: ICU's uconv vs Linux iconv and UTF-8

Mark Davis \(jtcsv\) Fri, 01 Feb 2002 15:50:10 -0800

It is definitely a problem to try to interpret what any given label is
supposed to be. The problem is that MIME labels and others are
ambiguous, and are interpreted different ways on different systems.


MIME/IANA is the best registry we have, but there are a number of
significant problems:

- because for most mappings there is no published mapping in the
registry to
and from Unicode/10646 it is not clear, and certainly not easy, to
figure
out exactly what the "unambiguous decoding" is.

- in practice, the industry does NOT interpret the same bytes the same
way;
example, you will get different decodings from "SJIS" on different
platforms.

One of the current projects under development for an upcoming release
of ICU is to have a more precise API, where you can pass in a label
AND a platform (AND version), and get what the platform interprets
that label to mean. That way you can ask for "EUC-JP" as interpreted
on, say, Solaris.

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο 
πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Nick Ing-Simmons" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>;
<[EMAIL PROTECTED]>; "SADAHIRO Tomoyuki" <[EMAIL PROTECTED]>
Sent: Friday, February 01, 2002 10:21
Subject: Re: ICU's uconv vs Linux iconv and UTF-8


> Mark Davis <[EMAIL PROTECTED]> writes:
> >>ICU's pedantic form
> >
> >The goal for ICU is to be charset neutral, and support all of the
> >conversions that are in modern use. There are a large number of
> >variants of character sets;
>
>
> Fair enough - but as shipped (I downloaded it earlier this week)
> it comes with a convrtrs.txt which maps MIME's EUC-JP onto
> something it calls ibm-33722 which has the behaviour I reported in
at
> the start of this thread.
>
> >you can use the one you want.
>
> It is not a question of which _I_ want - it is a question of which
one(s)
> CJK perl users want/expect/need.
>
> In so far a _I_ want any particular one it is the one which is going
> to match the X11 font encoding so I can in my naive westerner's way
> see what it looks like - and I have not a clue which one that is ...
>
> >See:
> >
> >http://oss.software.ibm.com/icu/charset/index.html
>
> I huge list and I don't see how to "grep" it for the provenance of
> the table (not that many seem to have any).
>
> So can the experts - ideally native reading experts not theorists -
tell
> me which ICU (or other open source) table(s) they want/expect/need,
> or failing that which ones have proven troublesome.
>
> There seem to be at least 4 EUC-JP mappings in that list
> AIX, Solaris, glibc and Java
>
> If we cannot get any answers "quickly" then I think Dan is correct -
> we should un-bundle the whole CJK encoding stuff from the "core"
into
> a family of CPAN modules.
>
> Which gives me a design choice:
>
> A. Bundle a "pragmatic" set of CJK which are fast and causes least
build
>    pain for non CJK users (i.e. compact precompiled form)
>
> B. Make it as easy as possible for end-user to drop in a new
encoding
>    from (say) a .ucm file.
>
> I can obvioulsy try for both - but they seem to be pulling in
opposite
> directions at present.
>
> Meanwhile I will go fix the bugs in the core's :encoding logic ...
>
> --
> Nick Ing-Simmons
> http://www.ni-s.u-net.com/
>
>

Re: ICU's uconv vs Linux iconv and UTF-8

Reply via email to