Earl Hood <[EMAIL PROTECTED]> writes: >> > ISO-8859-3 -> ISO-8859-8 >> > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` >> > >abcdefghijklmnopqrstuvwxyz{|}~ Â >> > > Ħ˘Â£Â¤\xA5Ĥ§¨ > >Look more closely, I will include some of the output again: > > Ħ˘£¤\xA5Ĥ§¨ > İŞĞĴ\xAEŻ°ħ²³´µĥ·¸ > ışğĵ½\xBEżÀÁÂ\xC3 > ÄĊĈÇÈÉÊËÌÍ > ÎÏ\xD0ÑÒÓÔĠÖªĜ > ÙÚÛÜŬŜßàáâ > \xE3äċĉçèéêëì > íîï\xF0ñòóôġöº > ĝùúûüŭŝ˙ > >Problem characters: \xA5 \xAE \xBE \xC3 \xD0 \xE3 \xF0 > > >> The FB_XMLCREF happens on the 'to' side. Your original code suffers >> from fallbacks occuring on the 'from' side. 0x80..0xFF are not ASCII. >> >> So when you use an 8-bit encoding like iso8859-3 you don't see the problem. > >See above where I highlight the problem characters.
So I was too glib. You see your "problem" when the octet is not defined in the source character set. e.g. 0xA5 is not given a meaning by iso-8859-3. >Also, with >the iso-2022-jp examples provided in my original post, illustrated >the problem. Possibly. iso-2022-jp is an escape encoding and has a whole slew of other things to worry about. > >BTW, in the t/fallbacks.t test case of Encode, 8-bit characters are >used for the ascii test, and entity references are generated for the >8-bit characters. > >As I stated in my original post, the problem is that t/fallbacks.t >tests an undocumented (or poorly documented) Encode interface, and >it does not test the well-documented interface. Whether un(der)?documented or not the object style used in t/fallback.t is the way the internals work. You say "... it is impractical to maintain unique conversion tables between all types of character encodings." - it is even more impractical to _test_ them that way. ... > >For example, extending from my code sample in the original post, >if you add the following: > > my $meth = find_encoding('ascii'); > my $src = $org; > my $dst = $meth->encode($src, FB_XMLCREF); > print $dst, "\n"; > >The following is generated: > > !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_` > abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…† > ‡ˆ‰Š‹ŒŽ‘’ > “”•–—˜™š›œž > Ÿ ¡¢£¤¥¦§¨©ª > «¬­®¯°±²³´µ¶ > ·¸¹º»¼½¾¿ÀÁ > ÃÄÅÆÇÈÉÊËÌÍÎ > ÏÐÑÒÓÔÕÖ×ØÙÚ > ÛÜÝÞßàáâãäåæ > çèéêëìíîïðñò > óôõö÷øùúûüýþ > ÿ > >So why doesn't the from_to() usage generate the same results? Because the ->decode side has removed the non-representable octets and replaced them with 4-chars each: \xHH. So there are no hi-bit chars to cause entity refs. > >IMO, the ASCII case is then wrong. If you want to be "strict" about >the 7-bitness of ascii, then the "\xHH"s should not show up all, but >be '?'s, or something else. You can get that (I believe) by passing appropriate fallback options to ->decode of ASCII. I personally dislike fallback to '?' as it looses information in a way that is hard to back-track - which is why default fallback is \xHH. >Since the output is "\xHH"s, it seems >odd that FB_XMLCREF does not generate "&#HH;"s instead (see >above). XMLCREFs are Unicode - ASCII 0xA0 (e.g.) is NOT Unicode it is undefined. > >Maybe I am misunderatanding Encode's conversion operations, so >maybe it is a problem with the documentation not being clear about >this behavior. But IMHO, what I am getting appears to be incorrect. And IMHO you are getting what I "designed" it to produce ;-) I strongly recommend doing conversions in two steps explcitly - that way you can get whatever you want. I am also willing to concede that documentation could be improved :-) > >--ewh -- Nick Ing-Simmons http://www.ni-s.u-net.com/