Hi, I'm trying to make a UCM file to feed to enc2xs. The legacy encoding for Taiwanese romanization *must* have its code points mapped to Unicode character sequences, for the simple reason that the UCS lacks the corresponding precomposed characters (and is unlikely to have them in the future, as they are composable using existing characters from the Latin script and the Diacritical Combining Marks blocks). (See [1] for script details.)
Now, IBM's ICU pages document the mapping of one Unicode to one legacy codepoint as well as one-to-many but, apparently, not many-to-one or many-to-many: " In the CHARMAP section of a .ucm file, each line contains a Unicode code point (like <U{1-6 hexadecimal digits for the code point}> ), a codepage character byte sequence (each byte like \x{2 hexadecimal digits} ).... " [2] How does enc2xs deal with (or intend to deal with) such a case? Is the ICU specification to be followed rigidly? Since I am very new to Perl, .any insight is appreciated. [1] http://lomaji.com/poj/chart.html [2] http://oss.software.ibm.com/icu/userguide/conversion-data.html --Henry H. Tan-Tenn