Re: Fallback problems with Encode

Nick Ing-Simmons Sat, 28 Dec 2002 12:52:11 -0800

Earl Hood <[EMAIL PROTECTED]> writes:
>> >    ISO-8859-3 -> ISO-8859-8
>> >     !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
>> >     
>abcdefghijklmnopqrstuvwxyz{|}~ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ
>> >     
>ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ &#x126;&#x2d8;Â£Â¤\xA5&#x124;Â§Â¨
>
>Look more closely,  I will include some of the output again:
>
>  &#x126;&#x2d8;£¤\xA5&#x124;§¨
>  &#x130;&#x15e;&#x11e;&#x134;\xAE&#x17b;°&#x127;²³´µ&#x125;·¸
>  &#x131;&#x15f;&#x11f;&#x135;½\xBE&#x17c;&#xc0;&#xc1;&#xc2;\xC3
>  &#xc4;&#x10a;&#x108;&#xc7;&#xc8;&#xc9;&#xca;&#xcb;&#xcc;&#xcd;
>  &#xce;&#xcf;\xD0&#xd1;&#xd2;&#xd3;&#xd4;&#x120;&#xd6;ª&#x11c;
>  &#xd9;&#xda;&#xdb;&#xdc;&#x16c;&#x15c;&#xdf;&#xe0;&#xe1;&#xe2;
>  \xE3&#xe4;&#x10b;&#x109;&#xe7;&#xe8;&#xe9;&#xea;&#xeb;&#xec;
>  &#xed;&#xee;&#xef;\xF0&#xf1;&#xf2;&#xf3;&#xf4;&#x121;&#xf6;º
>  &#x11d;&#xf9;&#xfa;&#xfb;&#xfc;&#x16d;&#x15d;&#x2d9;
>
>Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0
>
>
>> The FB_XMLCREF happens on the 'to' side. Your original code suffers
>> from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.
>> 
>> So when you use an 8-bit encoding like iso8859-3 you don't see the problem.
>
>See above where I highlight the problem characters.


So I was too glib. You see your "problem" when the octet is not defined
in the source character set. e.g. 0xA5 is not given a meaning by iso-8859-3.

>Also, with
>the iso-2022-jp examples provided in my original post, illustrated
>the problem.

Possibly. iso-2022-jp is an escape encoding and has a whole slew of other 
things to worry about.

>
>BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
>used for the ascii test, and entity references are generated for the
>8-bit characters.
>
>As I stated in my original post, the problem is that t/fallbacks.t
>tests an undocumented (or poorly documented) Encode interface, and
>it does not test the well-documented interface.

Whether un(der)?documented or not the object style used in t/fallback.t 
is the way the internals work. 

You say "... it is impractical to maintain unique
conversion tables between all types of character encodings." - it is even 
more impractical to _test_ them that way.
...
>
>For example, extending from my code sample in the original post,
>if you add the following:
>
>  my $meth = find_encoding('ascii');
>  my $src  = $org;
>  my $dst  = $meth->encode($src, FB_XMLCREF);
>  print $dst, "\n";
>
>The following is generated:
>
>   !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
>  abcdefghijklmnopqrstuvwxyz{|}~&#x80;&#x81;&#x82;&#x83;&#x84;&#x85;&#x86;
>  &#x87;&#x88;&#x89;&#x8a;&#x8b;&#x8c;&#x8d;&#x8e;&#x8f;&#x90;&#x91;&#x92;
>  &#x93;&#x94;&#x95;&#x96;&#x97;&#x98;&#x99;&#x9a;&#x9b;&#x9c;&#x9d;&#x9e;
>  &#x9f;&#xa0;&#xa1;&#xa2;&#xa3;&#xa4;&#xa5;&#xa6;&#xa7;&#xa8;&#xa9;&#xaa;
>  &#xab;&#xac;&#xad;&#xae;&#xaf;&#xb0;&#xb1;&#xb2;&#xb3;&#xb4;&#xb5;&#xb6;
>  &#xb7;&#xb8;&#xb9;&#xba;&#xbb;&#xbc;&#xbd;&#xbe;&#xbf;&#xc0;&#xc1;&#xc2;
>  &#xc3;&#xc4;&#xc5;&#xc6;&#xc7;&#xc8;&#xc9;&#xca;&#xcb;&#xcc;&#xcd;&#xce;
>  &#xcf;&#xd0;&#xd1;&#xd2;&#xd3;&#xd4;&#xd5;&#xd6;&#xd7;&#xd8;&#xd9;&#xda;
>  &#xdb;&#xdc;&#xdd;&#xde;&#xdf;&#xe0;&#xe1;&#xe2;&#xe3;&#xe4;&#xe5;&#xe6;
>  &#xe7;&#xe8;&#xe9;&#xea;&#xeb;&#xec;&#xed;&#xee;&#xef;&#xf0;&#xf1;&#xf2;
>  &#xf3;&#xf4;&#xf5;&#xf6;&#xf7;&#xf8;&#xf9;&#xfa;&#xfb;&#xfc;&#xfd;&#xfe;
>  &#xff;
>
>So why doesn't the from_to() usage generate the same results?

Because the ->decode side has removed the non-representable octets
and replaced them with 4-chars each: \xHH. 
So there are no hi-bit chars to cause entity refs.

>
>IMO, the ASCII case is then wrong.  If you want to be "strict" about
>the 7-bitness of ascii, then the "\xHH"s should not show up all, but
>be '?'s, or something else.  

You can get that (I believe) by passing appropriate fallback options to 
->decode of ASCII. I personally dislike fallback to '?' as it looses 
information in a way that is hard to back-track - which is why default 
fallback is \xHH.

>Since the output is "\xHH"s, it seems
>odd that FB_XMLCREF does not generate "&#HH;"s instead (see
>above).

XMLCREFs are Unicode - ASCII 0xA0 (e.g.) is NOT Unicode it is undefined.

>
>Maybe I am misunderatanding Encode's conversion operations, so
>maybe it is a problem with the documentation not being clear about
>this behavior.  But IMHO, what I am getting appears to be incorrect.

And IMHO you are getting what I "designed" it to produce ;-) 

I strongly recommend doing conversions in two steps explcitly - that way 
you can get whatever you want.

I am also willing to concede that documentation could be improved :-)

>
>--ewh
-- 
Nick Ing-Simmons
http://www.ni-s.u-net.com/

Re: Fallback problems with Encode

Reply via email to