I hope I'm not muddying the waters but dbd::mysql UTF support for
returned
data (not metadata) seems to be nearly there. I need UTF8 support on
at least inserted and returned results-sets although I'm less bothered
by UTF8 table/column names. It would seem that if you define your
table as
using UTF8 then insertion is not a problem but retrieval is. The
following code nearly works - it is just the setting of the utf8
flag on
the returned data that is wrong:
#!/usr/bin/perl -w
use strict;
use DBI qw(:utils);
use charnames ':full';
use Encode;
print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; #
smiley
print join(" ", unpack("H*", $str)), "\n";
print "length(str) = ", length($str), "\n";
print "bytes::length(str) = ", bytes::length($str), "\n";
print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
print "data_string_desc: ", data_string_desc($str),"\n";
open OUT, ">uni.out";
binmode(OUT, ":utf8");
print OUT "$str\n";
# data written to uni.out is UTF8
my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
# there are posts on dbi-user as to whether both or either of
# the following should be set
$dbh->do("set character set utf8");
$dbh->do("set names utf8");
$dbh->do("drop table if exists utf");
$dbh->do("create table utf (a char(100)) default charset utf8");
my $sth = $dbh->prepare("insert into utf values (?)");
$sth->execute($str);
$sth = $dbh->prepare("select * from utf");
$sth->execute;
my @row = $sth->fetchrow_array;
print "data_string_desc (after fetch): ",
data_string_desc($row[0]),"\n";
# the following shows we'e got the right data back
# but perl does not know it is utf8
print join(" ", unpack("H*", $row[0])), "\n";
# turning on utf8 causes the rignt uf8 sequence to be output
# and hence sv_utf8_upgrade(sv) will probably work
Encode::_utf8_on($row[0]);
print "data_string_desc (after fetch): ",
data_string_desc($row[0]),"\n";
open OUT, ">utf.out";
binmode (OUT, ":utf8");
print OUT $row[0];
close OUT;
# data written to utf.out is not UTF8 unless is marked utf8
produces:
Is utf8::is_utf8 defined: 1
Is utf8::valid defined: 1
e298ba787878d790d8a7
length(str) = 6
bytes::length(str) = 10
utf8::is_utf8 = 1
data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
data_string_desc (after fetch): UTF8 off, non-ASCII, 10 characters
10 bytes
e298ba787878d790d8a7
with utf.out containing:
<C3><A2><C2><98><C2><BA>xxx<C3><97><C2><90><C3><98><C2><A7>
without that Encode::_utf8_on($row[0]);
Michael Kröll (apr-10-2006)
posted a change for DBD::db2 which seemed to sort this out for DB2
so a similar change could be added to mysql.
Hope this helps.
Martin
--
Martin J. Evans
Easysoft Ltd, UK
http://www.easysoft.com
On 24-Apr-2006 Tim Bunce wrote:
[I'm at the mysql conference and Patrick asked me about adding utf8
support to DBD::mysql. I said I'd look at the libmysql docs and
give my
thoughts. I'm posting to dbi-dev since it may be of interest to
others
interested in enhancing DBD::mysql and to other driver developers.
These are just random thoughts from a quick look at the docs.]
The keys mysql docs seem to be
http://dev.mysql.com/doc/refman/4.1/en/charset-connection.html
The mysql api and client->server protocol doesn't support passing
characterset info to the server on a per-statement / per-bind
value basis.
(http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-statement-datatypes.ht
m
l)
So the sane way to send utf8 to the server is by setting the
'connection
character set' to utf8 and then only sending utf8 (or its ASCII
subset)
to the server on that connection.
*** Fetching data:
MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD
structure.
It's the "character set number for the field".
So set the UTF8 flag based on that value. Something like:
(field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
I couldn't see any docs for the values of the charsetnr field.
Also, would be good to enable perl code to access the charsetnr
values:
$sth->{mysql_charsetnr}->[$i]
*** Fetching Metadata:
The above is a minimum. It doesn't address metadata like field names
($sth->{NAME}) that might also be in utf8. For that the driver
needs to
know if the 'connection character set' is currently utf8.
(The docs mention mysql->charset but it's not clear if that's part of
the public API.)
However it's detected, the code needs to end up doing:
(...connection charset is utf8...) ? SvUTF8_on(sv) :
SvUTF8_off(sv);
on the metadata.
*** SET NAMES '...'
Intercept SET NAMES and call the mysql_set_character_set() API
instead.
See
http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-set.html
*** Detecting Inconsistencies
If the connection character set is _not_ utf8 but the application
calls
the driver with data (or SQL statement) that has the UTF8 flag
set, then
it could issue a warning. In practice that may be to be too noisy for
people that done their own workarounds for utf8 support. If so then
they could be changes to level 1 trace messages.
If the connection character set _is_ utf8, and the application calls
the driver with data (or SQL statement) that does _not_ have the UTF8
flag set but _does_ have bytes with the high bit set, then the driver
should issue a warning. The checking for high bit set is an extra
cost
so this should only be enabled if tracing and/or an attribute is set
(perhaps called $dbh->{mysql_charset_checks} = 1)
Tim.