----- Original Message ----- From: "Tim Bunce" <[EMAIL PROTECTED]> To: "Merijn van den Kroonenberg" <[EMAIL PROTECTED]> Subject: Re: perl, unicode and databases (mysql)
> On Tue, Aug 20, 2002 at 04:50:18PM +0200, Merijn van den Kroonenberg wrote: > > Thank you for the answer, > > > > I did some experimenting with the Devel::Peek module and i found the cause > > of my problem. > > I was using the DBI $DBHANDLE->quote($astring); method to quote (and slash) > > strings that i put in the database. Unfortunately this method is not unicode > > safe, and my data got corrupted. It looks like the data gets utf encoded > > twice. I wrote a temporary function to slash my data, but i would rather use > > the DBI method if possible. I have the feeling that this problem can be > > solved in some way, maybe someone can explain what is most likely causing > > the problem, and if i can do something to make it unicode safe (without > > having to modify the DBI module). If its not possible let me know too, then > > i just keep the temp function i use now ;-) > > In general the quote() method should be as aware of utf8 as the > database is. If the database supports utf8 then the quote() method > should do-the-right-thing or else it's broken and needs fixing. Well, when i quote it manually: ############################################################ # utf8_quote(string) sub utf8_quote($){ my $astring = shift; $astring =~ s/(['"\\\0])/\\$1/g; return "'".$astring."'"; }# utf8_quote ############################################################ Then i can store and retrieve it just fine. So i guess it supports utf8 ;-) > > > Oh yeah, one other thing, since Encode::_utf8_on is a internal function, > > wouldn't it be better to use Encode::decode("utf8",$somevar) instead? As far > > as i can see, it should do exactly the same, but if i am mistaken, let me > > know :) > > Encode::_utf8_on *just* sets the internal uft8 flag bit on the value > which *must* be already valid uft8 (or else you'll get problems later). > > I believe Encode::decode is different (but I've never used either and > could easily not know what I'm talking about :) from perldoc Encode CAVEAT: When you run "$string = decode("utf8", $octets)", then $string may not be equal to $octets. Though they both contain the same data, the utf8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines). See "The UTF-8 flag" below. Thats why i got that idea, so i wondered, cause it also seems to set the utf8 flag, and leave the data alone. Not sure tho. > > Tim. Thank you for the swift reply, Merijn van den Kroonenberg > > > Thank you, > > Merijn van den Kroonenberg > > > > > > ----- Original Message ----- > > From: "SADAHIRO Tomoyuki" <[EMAIL PROTECTED]> > > To: "Merijn van den Kroonenberg" <[EMAIL PROTECTED]> > > Cc: <[EMAIL PROTECTED]> > > Sent: Thursday, August 15, 2002 3:12 PM > > Subject: Re: perl, unicode and databases (mysql) > > > > > > > > > > On Tue, 13 Aug 2002 14:09:37 +0200 > > > "Merijn van den Kroonenberg" <[EMAIL PROTECTED]> wrote: > > > > > > > Hi all, > > > > > > > > I have a perl application (perl 5.8.0) which puts utf8 data in a mysql > > > > database. This seems to work pretty well, and the retrieving of the data > > > > with perl also works. Using something like this: > > > > > > > > my $sth = $db_handle->prepare("SELECT some query"); > > > > $sth->execute; > > > > my @row=$sth->fetchrow_array; > > > > print $row[0]."\n"; #### print before > > > > if ($]>5.007){ > > > > require Encode; > > > > Encode::_utf8_on($row[0]);} > > > > print $row[0]."\n"; #### print after > > > > $sth->finish; > > > > > > > > The Encode utf8_on gives me back good data. As far as i understood the > > > > _utf8_on method doesnt do any real conversions, but only switches the > > utf > > > > flag of a perl string? > > > > > > > > If you compare the two prints in above example, then it seems that after > > the > > > > utf flag is set the string is utf decoded. This results in the correct > > > > string, so it seems the original string is utf encoded (double encoded, > > > > since it already was UTF). > > > > > > > > When i select the same string manually (mysql prompt) or with PHP, then > > i > > > > get back the double encoded string. So it seems to me that the double > > > > encoded format is how perl stores it internally (and also in the > > database)? > > > > But this doesnt sound right to me...this would mean that everytime a utf > > > > flagged string is used it would need to be utf decoded. That sounds not > > very > > > > effecient to me, so i doubt its done that way. But meanwhile i have no > > idea > > > > how its done...and how its stored in the database. > > > > > > > > As you might have guessed i want to access the data i put in the > > database > > > > with PHP, but i get back double utf encoded data there. The problem > > could be > > > > in alot of different places, for example my fetching in PHP, storing in > > perl > > > > and maybe somewhere else where i have some faulty conversion. To check > > if > > > > the data in the database is correct i tried to figure out how perl works > > > > with the data. > > > > > > > > Maybe someone could put me on the right track, because this got me > > mighty > > > > confused ;-) > > > > > > To look what Perl's scalar holds, > > > use Devel/Peek.pm. > > > > > > #!perl > > > use Devel::Peek; > > > use Encode; > > > > > > our $camel_utf8 = "\351\247\261\351\247\235"; > > > > > > print STDERR "* _utf8_on\n\n"; > > > Encode::_utf8_on($camel_utf8); > > > Dump($camel_utf8); > > > > > > print STDERR "\n"; > > > > > > print STDERR "* _utf8_off\n\n"; > > > Encode::_utf8_off($camel_utf8); > > > Dump($camel_utf8); > > > > > > __END__ > > > > > > The output is like this. > > > The difference between _on and _off is found in FLAGS. > > > > > > * _utf8_on > > > > > > SV = PV(0x1661c60) at 0x166cccc > > > REFCNT = 1 > > > FLAGS = (POK,pPOK,UTF8) > > > PV = 0x16db4e0 "\351\247\261\351\247\235"\0 [UTF8 "\x{99f1}\x{99dd}"] > > > CUR = 6 > > > LEN = 7 > > > > > > * _utf8_off > > > > > > SV = PV(0x1661c60) at 0x166cccc > > > REFCNT = 1 > > > FLAGS = (POK,pPOK) > > > PV = 0x16db4e0 "\351\247\261\351\247\235"\0 > > > CUR = 6 > > > LEN = 7 > > > > > > > > > > > > SADAHIRO Tomoyuki > > > > > > > >