RE: Warning messages for ill-formed data

2003-03-24 Thread Mark Lewellen
I often encounter lower-ascii codes mixed in with Big5 text, which is
fine
and straightforward to handle.  However, a problem arises when upper
ascii occasionally occur outside of the Big5 range.  When such a
character occurs, this is probably an error or part of a user-defined
character.
However, it appears that Encode DOES NOT display warnings for these but
rather maps individual upper ascii to conventional characters such as
Roman letters with diacritics commonly found in European languages.
(It appears that Encode displays warnings for characters that are within
the Big5 range, but do not have a mapping to Unicode, perhaps because
these code points are not used in Big5 itself.)  

Is there a way to cause Encode to display warnings for upper ascii
outside
of the Big5 range when converting from Big5 to Unicode?  If not, could
the 
developers consider this for a future fix?

Mark

 
> > P.S. Another problem. How can it be determined whether that
> > user-defined character (UDC hereafter) is single-byte or
double-byte? 
> >
> > The file big5-eten.ucm does not contain how to determin the
character
> > length in bytes for an unmapped UDC.


> As I understand it, the "parsing" rules for big5 involve stepping 
> through the character stream one byte at a time, and:
> 
>  - if the byte just taken is 7-bit ASCII (hi-bit clear), you have one 
>  complete character (*); otherwise:
> 
>  - when the byte just taken is in the range [\xA1-\xFE], you have the 
>  first half of a 16-bit big5 character, and you need to get the next 
>  byte as well; if that next byte is in the range 
> [\x40-\x7E\xA1-\xFE], 
>  then you now have a complete big5 code point
> 
>  - an initial byte in the range [\x80-\xA0\xFF] is presumably 
> some form
>  of noise, and should be discarded; likewise, when expecting 
> the second
>  byte of a big5 character, a byte in the range 
> [\x00-\x3F\x7F-\xA0\xFF]
>  is also noise, and presumably both this byte and the one 
> preceding it 
>  should be discarded. (**)

> 
> footnotes:
> 
> (*) If reading a plain text file, you would of course expect 
> (hope) that
> the ASCII codes are limited to just white-space and [\x21-\x7E] (and 
> maybe \x07 "bell") -- i.e. no nulls, deletes, backspaces, EOT, etc; 
> still, if these occur, they should behave as ASCII for purposes of 
> parsing the characters.
> 
> (**) I'm really just guessing about what sort of action 
> should be taken
> when a stream violates the rules; discarding one or two bytes 
> at a time
> when they happen to be out of bounds should be the "safest" approach.
> 
> There is still the issue that those rules map out a very 
> large range of
> potential code points, many of which are not in fact used or 
> defined in
> Chinese.  Also, there must be some number of big5 code points that are
> used/defined (at least by some big5 applications), but are 
> not mapped to
> Unicode.  How Perl "decode()" handles these cases may be a 
> problem where
> developers still have some work to do to fix things...
> 



Re: Warning messages for ill-formed data

2003-03-24 Thread Dan Kogai
On Tuesday, Mar 25, 2003, at 13:59 Asia/Tokyo, Mark Lewellen wrote:
Is there a way to cause Encode to display warnings for upper ascii
outside of the Big5 range when converting from Big5 to Unicode?  If 
not, could
the developers consider this for a future fix?
Use the optional 3rd argument to decode().

$utf8 = decode("Big5" => $big5); # ill-formed chars are mapped to U+FFFD
$utf8 = decode("Big5" => $big5, Encode::FB_WARN); # same but warnings 
issued

see "Handling Malformed Data" of "perldoc Encode" for how to use the 
3rd argument.

Dan the Encode Maintainer