Re: Warning messages for ill-formed data

SADAHIRO Tomoyuki Fri, 21 Mar 2003 17:14:09 -0800

> SADAHIRO Tomoyuki <[EMAIL PROTECTED]> said:
> 
> > P.S. Another problem. How can it be determined whether that
> > user-defined character (UDC hereafter) is single-byte or double-byte? 
> >
> > The file big5-eten.ucm does not contain how to determin the character
> > length in bytes for an unmapped UDC.
> 
> As I understand it, the "parsing" rules for big5 involve stepping 
> through the character stream one byte at a time, and:
> 
>  - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one 
>  complete character (*); otherwise:
> 
>  - when the byte just taken is in the range [\xA1-\xFE], you have the 
>  first half of a 16-bit big5 character, and you need to get the next 
>  byte as well; if that next byte is in the range [\x40-\x7E\xA1-\xFE], 
>  then you now have a complete big5 code point
> 
>  - an initial byte in the range [\x80-\xA0\xFF] is presumably some form
>  of noise, and should be discarded; likewise, when expecting the second
>  byte of a big5 character, a byte in the range [\x00-\x3F\x7F-\xA0\xFF]
>  is also noise, and presumably both this byte and the one preceding it 
>  should be discarded. (**)


Right, but such a noise may be due to confusion
with CP-950 or BIG-5 HKSCS (or others?).
They have some character mapping in the area of leading byte \x81-\xA0.
We can use decode 'cp950' or decode 'big5-hkscs', though.

Well, the problem is possibly due to "big-5" has many, many variants.
  (cf. http://i18n.linux.org.tw/openi18n/big5/index_en.html )

> footnotes:
(snip)

> There is still the issue that those rules map out a very large range of
> potential code points, many of which are not in fact used or defined in
> Chinese.  Also, there must be some number of big5 code points that are
> used/defined (at least by some big5 applications), but are not mapped to
> Unicode.  How Perl "decode()" handles these cases may be a problem where
> developers still have some work to do to fix things...
>
>       Dave Graff

For example, Microsoft defines mapping
of extended UDC (EUDC) to Private Use Area (PUA) in Unicode.
These mapping can be computed algorithmically like following.

sub eudc2pua { # E000..F848
    my $cp = shift;

    if ($cp =~ /^([\x81-\x8D])([\x40-\x7E\xA1-\xFE])/) { # EEB8..F6B0
        my $le = ord($1);
        my $tr = ord($2);
        return 0xeeb8 +
            ($le - 0x81) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^([\x8E-\xA0])([\x40-\x7E\xA1-\xFE])/) { # E311..EEB7
        my $le = ord($1);
        my $tr = ord($2);
        return 0xe311 +
            ($le - 0x8e) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^\xC6([\xA1-\xFE])/) { # F6B1..F70E
        my $tr = ord($1);
        return 0xf6b1 + $tr - 0xA1;
    }
    if ($cp =~ /^([\xC7\xC8])([\x40-\x7E\xA1-\xFE])/) { # F70F..F848
        my $le = ord($1);
        my $tr = ord($2);
        return 0xf70f +
            ($le - 0xc7) * 0x9D + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    if ($cp =~ /^([\xFA-\xFE])([\x40-\x7E\xA1-\xFE])/) { # E000..E310
        my $le = ord($1);
        my $tr = ord($2);
        return 0xe000 +
            ($le - 0xfa) * 0x9d + $tr - ($tr >= 0xA1 ? 0x62 : 0x40);
    }
    return;
}


sub pua2eudc {
    my $uv = shift;
    if (0xe000 <= $uv && $uv <= 0xe310) {
        $uv -= 0xe000;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0xFA,
             $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xe311 <= $uv && $uv <= 0xeeb7) {
        $uv -= 0xe311;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0x8E,
            $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xeeb8 <= $uv && $uv <= 0xf6b0) {
        $uv -= 0xeeb8;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0x81,
            $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    if (0xf6b1 <= $uv && $uv <= 0xf70e) {
        $uv -= 0xf6b1;
        return pack 'CC', 0xC6, $uv + 0xA1;
    }
    if (0xf70f <= $uv && $uv <= 0xf848) {
        $uv -= 0xf70f;
        my $tr = $uv % 0x9D + 0x40;
        return pack 'CC', int($uv/0x9D) + 0xC7,
            $tr + ($tr > 0x7E ? 0x22 : 0);
    }
    return;
}

P.S. This EUDC mapping *was* available from Microsoft typography,
 ( http://www.microsoft.com/typography/default.asp )
but that file has been deleted.  Though I don't know the reason,
I guess it is (maybe) because the mapping was an older version
than that distributed now under www.unicode.org/Public/MAPPINGS.

However the fact that the leading byte range
for CP-950 is \x81-\xfe is shown in
  http://www.microsoft.com/globaldev/reference/dbcs/950.htm
 (additional leadbytes are identified by a darker gray background)
and in
  http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP950.TXT

SADAHIRO Tomoyuki

Re: Warning messages for ill-formed data

Reply via email to