Thank you for the answer, I did some experimenting with the Devel::Peek module and i found the cause of my problem. I was using the DBI $DBHANDLE->quote($astring); method to quote (and slash) strings that i put in the database. Unfortunately this method is not unicode safe, and my data got corrupted. It looks like the data gets utf encoded twice. I wrote a temporary function to slash my data, but i would rather use the DBI method if possible. I have the feeling that this problem can be solved in some way, maybe someone can explain what is most likely causing the problem, and if i can do something to make it unicode safe (without having to modify the DBI module). If its not possible let me know too, then i just keep the temp function i use now ;-)
Oh yeah, one other thing, since Encode::_utf8_on is a internal function, wouldn't it be better to use Encode::decode("utf8",$somevar) instead? As far as i can see, it should do exactly the same, but if i am mistaken, let me know :) Thank you, Merijn van den Kroonenberg ----- Original Message ----- From: "SADAHIRO Tomoyuki" <[EMAIL PROTECTED]> To: "Merijn van den Kroonenberg" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Thursday, August 15, 2002 3:12 PM Subject: Re: perl, unicode and databases (mysql) > > On Tue, 13 Aug 2002 14:09:37 +0200 > "Merijn van den Kroonenberg" <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > I have a perl application (perl 5.8.0) which puts utf8 data in a mysql > > database. This seems to work pretty well, and the retrieving of the data > > with perl also works. Using something like this: > > > > my $sth = $db_handle->prepare("SELECT some query"); > > $sth->execute; > > my @row=$sth->fetchrow_array; > > print $row[0]."\n"; #### print before > > if ($]>5.007){ > > require Encode; > > Encode::_utf8_on($row[0]);} > > print $row[0]."\n"; #### print after > > $sth->finish; > > > > The Encode utf8_on gives me back good data. As far as i understood the > > _utf8_on method doesnt do any real conversions, but only switches the utf > > flag of a perl string? > > > > If you compare the two prints in above example, then it seems that after the > > utf flag is set the string is utf decoded. This results in the correct > > string, so it seems the original string is utf encoded (double encoded, > > since it already was UTF). > > > > When i select the same string manually (mysql prompt) or with PHP, then i > > get back the double encoded string. So it seems to me that the double > > encoded format is how perl stores it internally (and also in the database)? > > But this doesnt sound right to me...this would mean that everytime a utf > > flagged string is used it would need to be utf decoded. That sounds not very > > effecient to me, so i doubt its done that way. But meanwhile i have no idea > > how its done...and how its stored in the database. > > > > As you might have guessed i want to access the data i put in the database > > with PHP, but i get back double utf encoded data there. The problem could be > > in alot of different places, for example my fetching in PHP, storing in perl > > and maybe somewhere else where i have some faulty conversion. To check if > > the data in the database is correct i tried to figure out how perl works > > with the data. > > > > Maybe someone could put me on the right track, because this got me mighty > > confused ;-) > > To look what Perl's scalar holds, > use Devel/Peek.pm. > > #!perl > use Devel::Peek; > use Encode; > > our $camel_utf8 = "\351\247\261\351\247\235"; > > print STDERR "* _utf8_on\n\n"; > Encode::_utf8_on($camel_utf8); > Dump($camel_utf8); > > print STDERR "\n"; > > print STDERR "* _utf8_off\n\n"; > Encode::_utf8_off($camel_utf8); > Dump($camel_utf8); > > __END__ > > The output is like this. > The difference between _on and _off is found in FLAGS. > > * _utf8_on > > SV = PV(0x1661c60) at 0x166cccc > REFCNT = 1 > FLAGS = (POK,pPOK,UTF8) > PV = 0x16db4e0 "\351\247\261\351\247\235"\0 [UTF8 "\x{99f1}\x{99dd}"] > CUR = 6 > LEN = 7 > > * _utf8_off > > SV = PV(0x1661c60) at 0x166cccc > REFCNT = 1 > FLAGS = (POK,pPOK) > PV = 0x16db4e0 "\351\247\261\351\247\235"\0 > CUR = 6 > LEN = 7 > > > > SADAHIRO Tomoyuki >