But can someone summarise the causes/issues into something we can all understand? [I don't have time to try to do that for myself.]
I'm attaching some code that may shed some light on this:
The program sets up $string_1 as a byte string, $string_8 as a UTF-8 string.
It uses Devel::Peek to display the perl internal representations of these,
and combinations of these. The output has my comments FOLLOW the section
they're referencing.
$string_1 = "\xa3" SV = PV(0x6864) at 0xdb00 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK) PV = 0xa420 "\243"\0 CUR = 1 LEN = 2
This is what I'd expect to see. 1 byte in = 1 byte out. Notice there's no UTF8 flag.
$string_8 = "\x{263a}"
SV = PV(0x684c) at 0x1ab44
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x26220 "\342\230\272"\0
CUR = 3
LEN = 4This surprised me. I guess the UTF-8 representation of this characters (a smiley face)
is a three byte encoding. Notice the UTF-8 flag.
$string_1 . $string_8 SV = PV(0x1ece4) at 0x1d770 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0xfef0 "\302\243\342\230\272"\0 CUR = 5 LEN = 6
This is the issue that has caught us all. By combining the two strings (and not
realizing that one was internally UTF-8) the formerly single byte string is
upgraded to UTF-8 on the fly. In this case it becomes a two byte encoding.
$string_1 . pack('H*',unpack('H*',$string_8))
SV = PV(0x1edd4) at 0x1d7c4
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK)
PV = 0xff30 "\243\342\230\272"\0
CUR = 4
LEN = 5This was my solution to explicitly coerce the UTF-8 string back to bytes WITHOUT
CHANGING the byte values. This behavior is more what I *expected* from Perl.
The absolutely INSANE thing in my case was that the values coming back from SOAP::Lite
were UTF-8 internally, but had nothing but single byte clean 7 bit ASCII values inside.
I think I would have solved the problem much more quickly had there been odd characters
within.
-- -- Tom Mornini, InfoMania Printing and Prepress -- -- ICQ: 113526784, AOL, Yahoo, MSN and Jabber: tmornini -- PGP: http://www.mornini.com/tmornini_infomania.asc
utf-8.pl
Description: application/applefile
utf-8.pl
Description: application/text
