It is definitely a problem to try to interpret what any given label is supposed to be. The problem is that MIME labels and others are ambiguous, and are interpreted different ways on different systems.
MIME/IANA is the best registry we have, but there are a number of significant problems: - because for most mappings there is no published mapping in the registry to and from Unicode/10646 it is not clear, and certainly not easy, to figure out exactly what the "unambiguous decoding" is. - in practice, the industry does NOT interpret the same bytes the same way; example, you will get different decodings from "SJIS" on different platforms. One of the current projects under development for an upcoming release of ICU is to have a more precise API, where you can pass in a label AND a platform (AND version), and get what the platform interprets that label to mean. That way you can ask for "EUC-JP" as interpreted on, say, Solaris. Mark ————— Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Nick Ing-Simmons" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>; "SADAHIRO Tomoyuki" <[EMAIL PROTECTED]> Sent: Friday, February 01, 2002 10:21 Subject: Re: ICU's uconv vs Linux iconv and UTF-8 > Mark Davis <[EMAIL PROTECTED]> writes: > >>ICU's pedantic form > > > >The goal for ICU is to be charset neutral, and support all of the > >conversions that are in modern use. There are a large number of > >variants of character sets; > > > Fair enough - but as shipped (I downloaded it earlier this week) > it comes with a convrtrs.txt which maps MIME's EUC-JP onto > something it calls ibm-33722 which has the behaviour I reported in at > the start of this thread. > > >you can use the one you want. > > It is not a question of which _I_ want - it is a question of which one(s) > CJK perl users want/expect/need. > > In so far a _I_ want any particular one it is the one which is going > to match the X11 font encoding so I can in my naive westerner's way > see what it looks like - and I have not a clue which one that is ... > > >See: > > > >http://oss.software.ibm.com/icu/charset/index.html > > I huge list and I don't see how to "grep" it for the provenance of > the table (not that many seem to have any). > > So can the experts - ideally native reading experts not theorists - tell > me which ICU (or other open source) table(s) they want/expect/need, > or failing that which ones have proven troublesome. > > There seem to be at least 4 EUC-JP mappings in that list > AIX, Solaris, glibc and Java > > If we cannot get any answers "quickly" then I think Dan is correct - > we should un-bundle the whole CJK encoding stuff from the "core" into > a family of CPAN modules. > > Which gives me a design choice: > > A. Bundle a "pragmatic" set of CJK which are fast and causes least build > pain for non CJK users (i.e. compact precompiled form) > > B. Make it as easy as possible for end-user to drop in a new encoding > from (say) a .ucm file. > > I can obvioulsy try for both - but they seem to be pulling in opposite > directions at present. > > Meanwhile I will go fix the bugs in the core's :encoding logic ... > > -- > Nick Ing-Simmons > http://www.ni-s.u-net.com/ > >