On December 23, 2002 at 22:41, Nick Ing-Simmons wrote:
> >Prints out the following:
> >
> > 1.83
> >
> > ASCII -> UTF8
> > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
> > abcdefghijklmnopqrstuvwxyz{|}~\x80\x81\x82\x83\x84\x85\x86\x87
> > \x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97
> >
> >After some further hacking, I notices that the success of the
> >FB_XMLCREF constant is not consistent. I add the following to the
> >script above:
> >
> > my $src = $org;
> > print "\nISO-8859-3 -> ISO-8859-8\n";
> > from_to($src, 'iso-8859-3', 'iso-8859-8', FB_XMLCREF);
> > print $src, "\n";
> >
> >
> > ISO-8859-3 -> ISO-8859-8
> > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
> > abcdefghijklmnopqrstuvwxyz{|}~
> > Ħ˘£¤\xA5Ĥ§¨
Look more closely, I will include some of the output again:
Ħ˘��\xA5Ĥ��
İŞĞĴ�\xAEŻ�ħ����ĥ��
ışğĵ�\xBEżÀÁÂ\xC3
ÄĊĈÇÈÉÊËÌÍ
ÎÏ\xD0ÑÒÓÔĠÖ�Ĝ
ÙÚÛÜŬŜßàáâ
\xE3äċĉçèéêëì
íîï\xF0ñòóôġö�
ĝùúûüŭŝ˙
Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0
> from_to is implemented by translating 'from' source to Unicode,
> and 'to' destination.
This is what I figured since it is impractical to maintain unique
conversion tables between all types of character encodings.
> The FB_XMLCREF happens on the 'to' side. Your original code suffers
> from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII.
>
> So when you use an 8-bit encoding like iso8859-3 you don't see the problem.
See above where I highlight the problem characters. Also, with
the iso-2022-jp examples provided in my original post, illustrated
the problem.
BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are
used for the ascii test, and entity references are generated for the
8-bit characters.
As I stated in my original post, the problem is that t/fallbacks.t
tests an undocumented (or poorly documented) Encode interface, and
it does not test the well-documented interface.
For example, extending from my code sample in the original post,
if you add the following:
my $meth = find_encoding('ascii');
my $src = $org;
my $dst = $meth->encode($src, FB_XMLCREF);
print $dst, "\n";
The following is generated:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`
abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†
‡ˆ‰Š‹ŒŽ‘’
“”•–—˜™š›œž
Ÿ ¡¢£¤¥¦§¨©ª
«¬­®¯°±²³´µ¶
·¸¹º»¼½¾¿ÀÁÂ
ÃÄÅÆÇÈÉÊËÌÍÎ
ÏÐÑÒÓÔÕÖרÙÚ
ÛÜÝÞßàáâãäåæ
çèéêëìíîïðñò
óôõö÷øùúûüýþ
ÿ
So why doesn't the from_to() usage generate the same results?
> The behaviour is (almost) by design - i.e. it happened that way and
> I decided it made a kind of sense. Using ASCII is considered as
> asking for 7-bit ness. If you want one of 8-bit super-sets use the
> one you want (iso8859-1 aka latin1 most likely, but perhaps one
> of the windows ones with smart quotes, m-dash etc.)
IMO, the ASCII case is then wrong. If you want to be "strict" about
the 7-bitness of ascii, then the "\xHH"s should not show up all, but
be '?'s, or something else. Since the output is "\xHH"s, it seems
odd that FB_XMLCREF does not generate "&#HH;"s instead (see
above).
Maybe I am misunderatanding Encode's conversion operations, so
maybe it is a problem with the documentation not being clear about
this behavior. But IMHO, what I am getting appears to be incorrect.
--ewh