Earl Hood <[EMAIL PROTECTED]> writes:
>> > ISO-8859-3 -> ISO-8859-8
>> > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
>> >
>abcdefghijklmnopqrstuvwxyz{|}~ÂÂÂÂÂÂ
ÂÂÂÂÂÂÂÂÂÂ
>> >
> Ħ˘Â£Â¤\xA5Ĥ§¨
>
>Look more closely, I will include some of the output again:
>
> Ħ˘£¤\xA5Ĥ§¨
> İŞĞĴ\xAEݰħ²³´µĥ·¸
> ışğĵ½\xBEżÀÁÂ\xC3
> ÄĊĈÇÈÉÊËÌÍ
> ÎÏ\xD0ÑÒÓÔĠÖªĜ
> ÙÚÛÜŬŜßàáâ
> \xE3äċĉçèéêëì
> íîï\xF0ñòóôġöº
> ĝùúûüŭŝ˙
>
>Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0
>
>
>> The FB_XMLCREF happens on the 'to' side. Your original code suffers
>> from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.
>>
>> So when you use an 8-bit encoding like iso8859-3 you don't see the problem.
>
>See above where I highlight the problem characters.
So I was too glib. You see your "problem" when the octet is not defined
in the source character set. e.g. 0xA5 is not given a meaning by iso-8859-3.
>Also, with
>the iso-2022-jp examples provided in my original post, illustrated
>the problem.
Possibly. iso-2022-jp is an escape encoding and has a whole slew of other
things to worry about.
>
>BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
>used for the ascii test, and entity references are generated for the
>8-bit characters.
>
>As I stated in my original post, the problem is that t/fallbacks.t
>tests an undocumented (or poorly documented) Encode interface, and
>it does not test the well-documented interface.
Whether un(der)?documented or not the object style used in t/fallback.t
is the way the internals work.
You say "... it is impractical to maintain unique
conversion tables between all types of character encodings." - it is even
more impractical to _test_ them that way.
...
>
>For example, extending from my code sample in the original post,
>if you add the following:
>
> my $meth = find_encoding('ascii');
> my $src = $org;
> my $dst = $meth->encode($src, FB_XMLCREF);
> print $dst, "\n";
>
>The following is generated:
>
> !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
> abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†
> ‡ˆ‰Š‹ŒŽ‘’
> “”•–—˜™š›œž
> Ÿ ¡¢£¤¥¦§¨©ª
> «¬­®¯°±²³´µ¶
> ·¸¹º»¼½¾¿ÀÁÂ
> ÃÄÅÆÇÈÉÊËÌÍÎ
> ÏÐÑÒÓÔÕÖרÙÚ
> ÛÜÝÞßàáâãäåæ
> çèéêëìíîïðñò
> óôõö÷øùúûüýþ
> ÿ
>
>So why doesn't the from_to() usage generate the same results?
Because the ->decode side has removed the non-representable octets
and replaced them with 4-chars each: \xHH.
So there are no hi-bit chars to cause entity refs.
>
>IMO, the ASCII case is then wrong. If you want to be "strict" about
>the 7-bitness of ascii, then the "\xHH"s should not show up all, but
>be '?'s, or something else.
You can get that (I believe) by passing appropriate fallback options to
->decode of ASCII. I personally dislike fallback to '?' as it looses
information in a way that is hard to back-track - which is why default
fallback is \xHH.
>Since the output is "\xHH"s, it seems
>odd that FB_XMLCREF does not generate "&#HH;"s instead (see
>above).
XMLCREFs are Unicode - ASCII 0xA0 (e.g.) is NOT Unicode it is undefined.
>
>Maybe I am misunderatanding Encode's conversion operations, so
>maybe it is a problem with the documentation not being clear about
>this behavior. But IMHO, what I am getting appears to be incorrect.
And IMHO you are getting what I "designed" it to produce ;-)
I strongly recommend doing conversions in two steps explcitly - that way
you can get whatever you want.
I am also willing to concede that documentation could be improved :-)
>
>--ewh
--
Nick Ing-Simmons
http://www.ni-s.u-net.com/