On Thursday, June 19, 2003, at 02:00 AM, Tim Bunce wrote:

But can someone summarise the causes/issues into something we can
all understand? [I don't have time to try to do that for myself.]

I'm attaching some code that may shed some light on this:

The program sets up $string_1 as a byte string, $string_8 as a UTF-8 string.

It uses Devel::Peek to display the perl internal representations of these,
and combinations of these. The output has my comments FOLLOW the section
they're referencing.

$string_1 = "\xa3"
SV = PV(0x6864) at 0xdb00
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0xa420 "\243"\0
  CUR = 1
  LEN = 2

This is what I'd expect to see. 1 byte in = 1 byte out. Notice there's no UTF8 flag.

$string_8 = "\x{263a}"
SV = PV(0x684c) at 0x1ab44
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0x26220 "\342\230\272"\0
  CUR = 3
  LEN = 4

This surprised me. I guess the UTF-8 representation of this characters (a smiley face)
is a three byte encoding. Notice the UTF-8 flag.

$string_1 . $string_8
SV = PV(0x1ece4) at 0x1d770
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
  PV = 0xfef0 "\302\243\342\230\272"\0
  CUR = 5
  LEN = 6

This is the issue that has caught us all. By combining the two strings (and not
realizing that one was internally UTF-8) the formerly single byte string is
upgraded to UTF-8 on the fly. In this case it becomes a two byte encoding.

$string_1 . pack('H*',unpack('H*',$string_8))
SV = PV(0x1edd4) at 0x1d7c4
  REFCNT = 1
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0xff30 "\243\342\230\272"\0
  CUR = 4
  LEN = 5

This was my solution to explicitly coerce the UTF-8 string back to bytes WITHOUT
CHANGING the byte values. This behavior is more what I *expected* from Perl.

The absolutely INSANE thing in my case was that the values coming back from SOAP::Lite
were UTF-8 internally, but had nothing but single byte clean 7 bit ASCII values inside.
I think I would have solved the problem much more quickly had there been odd characters
within.

--
-- Tom Mornini, InfoMania Printing and Prepress
--
-- ICQ: 113526784, AOL, Yahoo, MSN and Jabber: tmornini
-- PGP: http://www.mornini.com/tmornini_infomania.asc


Attachment: utf-8.pl
Description: application/applefile

Attachment: utf-8.pl
Description: application/text


Reply via email to