Re: perl, unicode and databases (mysql)

Merijn van den Kroonenberg Tue, 20 Aug 2002 09:07:58 -0700


----- Original Message -----
From: "Tim Bunce" <[EMAIL PROTECTED]>
To: "Merijn van den Kroonenberg" <[EMAIL PROTECTED]>
Subject: Re: perl, unicode and databases (mysql)



> On Tue, Aug 20, 2002 at 04:50:18PM +0200, Merijn van den Kroonenberg
wrote:
> > Thank you for the answer,
> >
> > I did some experimenting with the Devel::Peek module and i found the
cause
> > of my problem.
> > I was using the DBI $DBHANDLE->quote($astring); method to quote (and
slash)
> > strings that i put in the database. Unfortunately this method is not
unicode
> > safe, and my data got corrupted. It looks like the data gets utf encoded
> > twice. I wrote a temporary function to slash my data, but i would rather
use
> > the DBI method if possible. I have the feeling that this problem can be
> > solved in some way, maybe someone can explain what is most likely
causing
> > the problem, and if i can do something to make it unicode safe (without
> > having to modify the DBI module). If its not possible let me know too,
then
> > i just keep the temp function i use now ;-)
>
> In general the quote() method should be as aware of utf8 as the
> database is.  If the database supports utf8 then the quote() method
> should do-the-right-thing or else it's broken and needs fixing.

Well, when i quote it manually:

############################################################
# utf8_quote(string)
sub utf8_quote($){
  my $astring = shift;
  $astring =~ s/(['"\\\0])/\\$1/g;
  return "'".$astring."'";
}# utf8_quote
############################################################

Then i can store and retrieve it just fine. So i guess it supports utf8 ;-)

>
> > Oh yeah, one other thing, since Encode::_utf8_on is a internal function,
> > wouldn't it be better to use Encode::decode("utf8",$somevar) instead? As
far
> > as i can see, it should do exactly the same, but if i am mistaken, let
me
> > know :)
>
> Encode::_utf8_on *just* sets the internal uft8 flag bit on the value
> which *must* be already valid uft8 (or else you'll get problems later).
>
> I believe Encode::decode is different (but I've never used either and
> could easily not know what I'm talking about :)

from perldoc Encode
 CAVEAT: When you run "$string = decode("utf8",
         $octets)", then $string may not be equal to $octets.
         Though they both contain the same data, the utf8 flag
         for $string is on unless $octets entirely consists of
         ASCII data (or EBCDIC on EBCDIC machines).  See "The
         UTF-8 flag" below.

Thats why i got that idea, so i wondered, cause it also seems to set the
utf8 flag, and leave the data alone. Not sure tho.


>
> Tim.

Thank you for the swift reply,

Merijn van den Kroonenberg

>
> > Thank you,
> > Merijn van den Kroonenberg
> >
> >
> > ----- Original Message -----
> > From: "SADAHIRO Tomoyuki" <[EMAIL PROTECTED]>
> > To: "Merijn van den Kroonenberg" <[EMAIL PROTECTED]>
> > Cc: <[EMAIL PROTECTED]>
> > Sent: Thursday, August 15, 2002 3:12 PM
> > Subject: Re: perl, unicode and databases (mysql)
> >
> >
> > >
> > > On Tue, 13 Aug 2002 14:09:37 +0200
> > > "Merijn van den Kroonenberg" <[EMAIL PROTECTED]> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I have a perl application (perl 5.8.0) which puts utf8 data in a
mysql
> > > > database. This seems to work pretty well, and the retrieving of the
data
> > > > with perl also works. Using something like this:
> > > >
> > > > my $sth = $db_handle->prepare("SELECT some query");
> > > > $sth->execute;
> > > > my @row=$sth->fetchrow_array;
> > > > print $row[0]."\n"; #### print before
> > > > if ($]>5.007){
> > > >   require Encode;
> > > >   Encode::_utf8_on($row[0]);}
> > > > print $row[0]."\n"; #### print after
> > > > $sth->finish;
> > > >
> > > > The Encode utf8_on gives me back good data. As far as i understood
the
> > > > _utf8_on method doesnt do any real conversions, but only switches
the
> > utf
> > > > flag of a perl string?
> > > >
> > > > If you compare the two prints in above example, then it seems that
after
> > the
> > > > utf flag is set the string is utf decoded. This results in the
correct
> > > > string, so it seems the original string is utf encoded (double
encoded,
> > > > since it already was UTF).
> > > >
> > > > When i select the same string manually (mysql prompt) or with PHP,
then
> > i
> > > > get back the double encoded string. So it seems to me that the
double
> > > > encoded format is how perl stores it internally (and also in the
> > database)?
> > > > But this doesnt sound right to me...this would mean that everytime a
utf
> > > > flagged string is used it would need to be utf decoded. That sounds
not
> > very
> > > > effecient to me, so i doubt its done that way. But meanwhile i have
no
> > idea
> > > > how its done...and how its stored in the database.
> > > >
> > > > As you might have guessed i want to access the data i put in the
> > database
> > > > with PHP, but i get back double utf encoded data there. The problem
> > could be
> > > > in alot of different places, for example my fetching in PHP, storing
in
> > perl
> > > > and maybe somewhere else where i have some faulty conversion. To
check
> > if
> > > > the data in the database is correct i tried to figure out how perl
works
> > > > with the data.
> > > >
> > > > Maybe someone could put me on the right track, because this got me
> > mighty
> > > > confused ;-)
> > >
> > > To look what Perl's scalar holds,
> > > use Devel/Peek.pm.
> > >
> > > #!perl
> > > use Devel::Peek;
> > > use Encode;
> > >
> > > our $camel_utf8 = "\351\247\261\351\247\235";
> > >
> > > print STDERR "* _utf8_on\n\n";
> > > Encode::_utf8_on($camel_utf8);
> > > Dump($camel_utf8);
> > >
> > > print STDERR "\n";
> > >
> > > print STDERR "* _utf8_off\n\n";
> > > Encode::_utf8_off($camel_utf8);
> > > Dump($camel_utf8);
> > >
> > > __END__
> > >
> > > The output is like this.
> > > The difference between _on and _off is found in FLAGS.
> > >
> > > * _utf8_on
> > >
> > > SV = PV(0x1661c60) at 0x166cccc
> > >   REFCNT = 1
> > >   FLAGS = (POK,pPOK,UTF8)
> > >   PV = 0x16db4e0 "\351\247\261\351\247\235"\0 [UTF8
"\x{99f1}\x{99dd}"]
> > >   CUR = 6
> > >   LEN = 7
> > >
> > > * _utf8_off
> > >
> > > SV = PV(0x1661c60) at 0x166cccc
> > >   REFCNT = 1
> > >   FLAGS = (POK,pPOK)
> > >   PV = 0x16db4e0 "\351\247\261\351\247\235"\0
> > >   CUR = 6
> > >   LEN = 7
> > >
> > >
> > >
> > > SADAHIRO Tomoyuki
> > >
> >
> >
>

Re: perl, unicode and databases (mysql)

Reply via email to