> SADAHIRO Tomoyuki <[EMAIL PROTECTED]> said: > > > P.S. Another problem. How can it be determined whether that > > user-defined character (UDC hereafter) is single-byte or double-byte? > > > > The file big5-eten.ucm does not contain how to determin the character > > length in bytes for an unmapped UDC. > > As I understand it, the "parsing" rules for big5 involve stepping > through the character stream one byte at a time, and: > > - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one > complete character (*); otherwise: > > - when the byte just taken is in the range [\xA1-\xFE], you have the > first half of a 16-bit big5 character, and you need to get the next > byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE], > then you now have a complete big5 code point > > - an initial byte in the range [\x80-\xA0\xFF] is presumably some form > of noise, and should be discarded; likewise, when expecting the second > byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF] > is also noise, and presumably both this byte and the one preceding it > should be discarded. (**)
Right, but such a noise may be due to confusion with CP-950 or BIG-5 HKSCS (or others?). They have some character mapping in the area of leading byte \x81-\xA0. We can use decode 'cp950' or decode 'big5-hkscs', though. Well, the problem is possibly due to "big-5" has many, many variants. (cf. http://i18n.linux.org.tw/openi18n/big5/index_en.html ) > footnotes: (snip) > There is still the issue that those rules map out a very large range of > potential code points, many of which are not in fact used or defined in > Chinese. Also, there must be some number of big5 code points that are > used/defined (at least by some big5 applications), but are not mapped to > Unicode. How Perl "decode()" handles these cases may be a problem where > developers still have some work to do to fix things... > > Dave Graff For example, Microsoft defines mapping of extended UDC (EUDC) to Private Use Area (PUA) in Unicode. These mapping can be computed algorithmically like following. sub eudc2pua { # E000..F848 my $cp = shift; if ($cp =~ /^([\x81-\x8D])([\x40-\x7E\xA1-\xFE])/) { # EEB8..F6B0 my $le = ord($1); my $tr = ord($2); return 0xeeb8 + ($le - 0x81) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40); } if ($cp =~ /^([\x8E-\xA0])([\x40-\x7E\xA1-\xFE])/) { # E311..EEB7 my $le = ord($1); my $tr = ord($2); return 0xe311 + ($le - 0x8e) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40); } if ($cp =~ /^\xC6([\xA1-\xFE])/) { # F6B1..F70E my $tr = ord($1); return 0xf6b1 + $tr - 0xA1; } if ($cp =~ /^([\xC7\xC8])([\x40-\x7E\xA1-\xFE])/) { # F70F..F848 my $le = ord($1); my $tr = ord($2); return 0xf70f + ($le - 0xc7) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40); } if ($cp =~ /^([\xFA-\xFE])([\x40-\x7E\xA1-\xFE])/) { # E000..E310 my $le = ord($1); my $tr = ord($2); return 0xe000 + ($le - 0xfa) * 0x9d + $tr - ($tr >= 0xA1 ? 0x62 : 0x40); } return; } sub pua2eudc { my $uv = shift; if (0xe000 <= $uv && $uv <= 0xe310) { $uv -= 0xe000; my $tr = $uv % 0x9D + 0x40; return pack 'CC', int($uv/0x9D) + 0xFA, $tr + ($tr > 0x7E ? 0x22 : 0); } if (0xe311 <= $uv && $uv <= 0xeeb7) { $uv -= 0xe311; my $tr = $uv % 0x9D + 0x40; return pack 'CC', int($uv/0x9D) + 0x8E, $tr + ($tr > 0x7E ? 0x22 : 0); } if (0xeeb8 <= $uv && $uv <= 0xf6b0) { $uv -= 0xeeb8; my $tr = $uv % 0x9D + 0x40; return pack 'CC', int($uv/0x9D) + 0x81, $tr + ($tr > 0x7E ? 0x22 : 0); } if (0xf6b1 <= $uv && $uv <= 0xf70e) { $uv -= 0xf6b1; return pack 'CC', 0xC6, $uv + 0xA1; } if (0xf70f <= $uv && $uv <= 0xf848) { $uv -= 0xf70f; my $tr = $uv % 0x9D + 0x40; return pack 'CC', int($uv/0x9D) + 0xC7, $tr + ($tr > 0x7E ? 0x22 : 0); } return; } P.S. This EUDC mapping *was* available from Microsoft typography, ( http://www.microsoft.com/typography/default.asp ) but that file has been deleted. Though I don't know the reason, I guess it is (maybe) because the mapping was an older version than that distributed now under www.unicode.org/Public/MAPPINGS. However the fact that the leading byte range for CP-950 is \x81-\xfe is shown in http://www.microsoft.com/globaldev/reference/dbcs/950.htm (additional leadbytes are identified by a darker gray background) and in http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT SADAHIRO Tomoyuki
