UCM file and combining character sequences

Hank TT Sat, 20 Sep 2003 20:50:05 -0700

Hi,

I'm trying to make a UCM file to feed to enc2xs.  The legacy encoding for
Taiwanese romanization *must* have its code points mapped to Unicode
character sequences, for the simple reason that the UCS lacks the
corresponding precomposed characters (and is unlikely to have them in the
future, as they are composable using existing characters from the Latin
script and the Diacritical Combining Marks blocks).  (See [1] for script
details.)


Now, IBM's ICU pages document the mapping of one Unicode to one legacy
codepoint as well as one-to-many but, apparently, not many-to-one or
many-to-many:

" In the CHARMAP section of a .ucm file, each line contains a Unicode code
point (like <U{1-6 hexadecimal digits for the code point}> ), a codepage
character byte sequence (each byte like \x{2 hexadecimal digits} ).... " [2]

How does enc2xs deal with (or intend to deal with) such a case?  Is the ICU
specification to be followed rigidly?

Since I am very new to Perl, .any insight is appreciated.

[1] http://lomaji.com/poj/chart.html
[2] http://oss.software.ibm.com/icu/userguide/conversion-data.html

--Henry H. Tan-Tenn

UCM file and combining character sequences

Reply via email to