Re: UCM file and combining character sequences

Nick Ing-Simmons Mon, 22 Sep 2003 01:08:35 -0700

Hank Tt <[EMAIL PROTECTED]> writes:
>Hi,
>
>I'm trying to make a UCM file to feed to enc2xs.  The legacy encoding for
>Taiwanese romanization *must* have its code points mapped to Unicode
>character sequences, for the simple reason that the UCS lacks the
>corresponding precomposed characters (and is unlikely to have them in the
>future, as they are composable using existing characters from the Latin
>script and the Diacritical Combining Marks blocks).  (See [1] for script
>details.)
>
>Now, IBM's ICU pages document the mapping of one Unicode to one legacy
>codepoint as well as one-to-many but, apparently, not many-to-one or
>many-to-many:
>
>" In the CHARMAP section of a .ucm file, each line contains a Unicode code
>point (like <U{1-6 hexadecimal digits for the code point}> ), a codepage
>character byte sequence (each byte like \x{2 hexadecimal digits} ).... " [2]
>
>How does enc2xs deal with (or intend to deal with) such a case?


It may not in its current form.

The underlying C code engine is an octet-sequence->octet-sequence 
converter. So provided the source encoding is unambiguous (without
lookahead) then it can be represented. Whether ucm can handle it is 
less clear, but I don't see why not. It too has two chunks of octets per-line.
What may need some work is the table building so that 
reverse mapping - base-char+mark return one encoded thing.

>Is the ICU
>specification to be followed rigidly?

No, Pragmatically - but we may not yet be handling all that ICU can
express.

>
>Since I am very new to Perl, .any insight is appreciated.
>
>[1] http://lomaji.com/poj/chart.html
>[2] http://oss.software.ibm.com/icu/userguide/conversion-data.html
>
>--Henry H. Tan-Tenn

Re: UCM file and combining character sequences

Reply via email to