Hank Tt <[EMAIL PROTECTED]> writes: >Hi, > >I'm trying to make a UCM file to feed to enc2xs. The legacy encoding for >Taiwanese romanization *must* have its code points mapped to Unicode >character sequences, for the simple reason that the UCS lacks the >corresponding precomposed characters (and is unlikely to have them in the >future, as they are composable using existing characters from the Latin >script and the Diacritical Combining Marks blocks). (See [1] for script >details.) > >Now, IBM's ICU pages document the mapping of one Unicode to one legacy >codepoint as well as one-to-many but, apparently, not many-to-one or >many-to-many: > >" In the CHARMAP section of a .ucm file, each line contains a Unicode code >point (like <U{1-6 hexadecimal digits for the code point}> ), a codepage >character byte sequence (each byte like \x{2 hexadecimal digits} ).... " [2] > >How does enc2xs deal with (or intend to deal with) such a case?
It may not in its current form. The underlying C code engine is an octet-sequence->octet-sequence converter. So provided the source encoding is unambiguous (without lookahead) then it can be represented. Whether ucm can handle it is less clear, but I don't see why not. It too has two chunks of octets per-line. What may need some work is the table building so that reverse mapping - base-char+mark return one encoded thing. >Is the ICU >specification to be followed rigidly? No, Pragmatically - but we may not yet be handling all that ICU can express. > >Since I am very new to Perl, .any insight is appreciated. > >[1] http://lomaji.com/poj/chart.html >[2] http://oss.software.ibm.com/icu/userguide/conversion-data.html > >--Henry H. Tan-Tenn