Nick Ing-Simmons <[EMAIL PROTECTED]> wrote: > SADAHIRO Tomoyuki <[EMAIL PROTECTED]> writes: > >Hello. > > > >For round-trip fidelity, Mac OS CJK encodings include many characters > >with mapping a single character in a Mac OS encoding > >to a sequence of standard Unicode characters. > >(cf. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/README.TXT ) > > > >In the case of Encode.pm, such characters are marked with |3 > >("reverse fallback", only from the encoding to Unicode, but not back), > >so roundtrip conversion is not achieved. > > I think I copied those markings from ICU. I am not 100% sure that fallbacks > are "compiled" correctly, and I am not an expert on CJK stuff. > If it would be more "perlish" to make the round-trip conversion work > by default Encode.pm can be less pedantic than ICU and allow it.
Never mind. Actually I don't work with Macintosh, so I'm not sure what people with Mac (itself and/or its encodings) would desire. I would like to give an example: when handling of MacKorean via Unicode, the following behavior is not inconsistent, though he/she might feel it strange. #!perl use encoding 'MacKorean'; # (1) our $string = "\xAA\x45\xB4\xEB\xAA\x8E"; # (2) $string =~ s/\xB4\xEB//g; # (3) print $string; # (4) __END__ <RESULT> "\x{20de}" does not map to MacKorean. "\x{20dd}" does not map to MacKorean. \x{20de}\x{20dd} cf. a part of macKorean.ucm <UB300> \xB4\xEB |0 # hangul DAE <UB300><U20DD> \xAA\x8E |3 # hangul DAE + COMBINING ENCLOSING CIRCLE <UB300><U20DE> \xAA\x45 |3 # hangul DAE + COMBINING ENCLOSING SQUARE At line (2), $string consists of three Korean characters: "\xAA\x45", "\xB4\xEB", and "\xAA\x8E". Someone, who thinks (3) should remove "\xB4\xEB", should think the result (4) should be "\xAA\x45\xAA\x8E". This "problem" must not be resolved by round-trip. What might be a solution is: (I don't think any of them would be very practical, though.) (a) mapping *all* the characters in an encoding to a single Unicode character (e.g. to private use areas). (b) grapheme aware operations that will distinguish \x{B300}\x{20DD} from \x{B300} as a grapheme. * but \X is insufficient; it must cope with the hint characters in PUA (http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/CORPCHAR.TXT) including: 0xF860 # transcoding hint: group next 2 characters # Japanese,Korean 0xF861 # transcoding hint: group next 3 characters # Japanese,Korean 0xF862 # transcoding hint: group next 4 characters # Japanese,Korean Then /\x{F860}\p{Any}{2}/, /\x{F861}\p{Any}{3}/, /\x{F862}\p{Any}{4}/, etc. are a single grapheme for Macintosh encodings. (c) multibyte aware operations w/o conversion to Unicode (something like Jperl in old days). (d) giving up Macintosh encodings... regards, SADAHIRO Tomoyuki