Re: Add Unicode Support to the DBI

Martin J. Evans Tue, 08 Nov 2011 12:49:51 -0800

On 08/11/2011 13:16, Tim Bunce wrote:

On Mon, Nov 07, 2011 at 01:37:38PM +0000, Martin J. Evans wrote:

I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.

All your beers at LPW are on me!

http://www.martin-evans.me.uk/node/121

Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
     Specifically separating those that understand character encodings
     at one or more of column, table, schema, database level.
     Answer the questions:
         what "Unicode support" is this database capable of? [vague]
         are particular column data types or attributes needed?
         does the db have a session/connection encoding concept?
         does the db support binary data types.
         does the client api identify data encoding?
     A table summarizing this kind of info would be of great value.
     I think this is the most important kind of data we need to move
     forward with this topic.  I suspect we'll end up with a few clear
     levels of "unicode support" by databases that we can then focus on
     more clearly.

this will take some time but I'll work on it. It could be a nightmarefor DBD::ODBC as it supports dozens of dbs so I'll have to be a bitcagey about the stuff like "what "Unicode support" is this databasecapable of".


Contributions welcome.

2. Try to make a data-driven common test script.
     It should fetch the length of the stored value, something like:
         CREATE TABLE t (c VARCHAR(10));
         INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
         SELECT LENGTH(c), c FROM t
     Fetching the LENGTH is important because it tells us if the DB is
     treating the value as Unicode.  The description of DBD::Unify, for
     example, doesn't clarify if the db itself regards the stored value
     as unicode or the underlying string of encoded bytes.

Changed to add length(c) test - sorry misread that first time as Perlcode length(c) not SQL - I'm not well enough to think so clearly right now.

Will take some time to run through the DBDs to check whether it islength()/len() or something else - isn't SQL greate.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs arelen or length.

BTW, DBD::mysql fails the length test using the length functionchar_length (documented as Return number of characters in argument) - itreturns 3 for the smiley not 1, not investigated why yet.

     Also probably best to avoid latin characters for this, I'd use
     something that always has a multi-byte encoding, like a simley face char.

changed to use smiley face for data but not for table/column name checksas most dbs have other rules on table/column names.

3. Focus on placeholders initially.
     We can ponder utf8 in literal SQL later. That's a separate ball of mud.
     (I'd also ignore unicode table/column/db names. It's a much lower
     priority and may become clearer when other issues get resolved.)

test already does that so left in - it is easy to comment it out - oneline change.

4. Tests could report local LANG / LC_ALL env var value
     so when others report their results we'll have that context.

added although I stopped short of outputting everything in %ENV as Iguessed people would end up having to edit it to remove stuff - it doesLANG, LC_* and NLS_* right now.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
     my $sample_string = "\x{263A}";
     ...
     print data_diff($sample_string, $returned_string);

added or data_diff() to tests which fail -
is($x, $y, "test") or data_diff($x,$y)

Test::More::is output is horrible for unicode on a non UTF-8 enabledterminal.

BTW, the smiley character causes me problems as for some reason with MSSQL Server you can do:


insert into table values(?)
bind_param("\{many_unicode_chars}")
select column from table where column = $h->quote("\{many_unicode_chars}")

but for some reason the smiley does not work in the select SQL eventhough the insert works :-( A parameterized select works fine too.

I will upload the current script to DBI's subversion tree in ex dir butif anyone is going to do anything major with it I'd rather they mail saydbi-dev or me first and I can at least check that before making anychanges myself (at least until after LPW). Having said that as thebirthday boy tomorrow I can almost guarantee I won't be working on ittomorrow night. Also, bare in mind I didn't start out writing the testcode for inclusion in a public repository and it is very time consumingto check it across multiple DBDs. The do_connect method contains connectcalls that work for me here and will need changing if you want to run itagainst your data sources - it was just too much of a pain to do it viaDBI_DSN/DBI_USER/DBI_PASS whilst I was gathering info (especially onWindows - my Linux home box is bust right now).

I hang around #dbi irc. If someone wants to make this test code workwith Postgres I'd be happy to apply any changes - Postgres is one I'dreally like to include but it is proving too time consuming to addanother DBD - especially one where I know active work is going on withunicode/encoding support and I'd have to work with the latest trunk.

Lastly, if anyone knows the pub meeting for LPW on Friday night couldthey tell me as I've forgotten - lots of people owe me drinks so I needto get there in plenty of time :-)


Martin

Re: Add Unicode Support to the DBI

Reply via email to