On 08/11/2011 13:16, Tim Bunce wrote:
On Mon, Nov 07, 2011 at 01:37:38PM +0000, Martin J. Evans wrote:
I didn't think I was going to make LPW but it seems I will now - although it 
has cost me big time leaving it until the last minute.
All your beers at LPW are on me!

http://www.martin-evans.me.uk/node/121
Great work Martin. Many thanks.

I've some comments and suggestions for you...

It says "There is no single way across DBDs to enable Unicode support"
but doesn't define what "Unicode support" actually means. Clearly the
"Unicode support" of Oracle will be different to that of a CSV file.
So it seems that we need to be really clear about what we want.

I'd suggest...

1. Focus initially on categorising the capabilities of the databases.
     Specifically separating those that understand character encodings
     at one or more of column, table, schema, database level.
     Answer the questions:
         what "Unicode support" is this database capable of? [vague]
         are particular column data types or attributes needed?
         does the db have a session/connection encoding concept?
         does the db support binary data types.
         does the client api identify data encoding?
     A table summarizing this kind of info would be of great value.
     I think this is the most important kind of data we need to move
     forward with this topic.  I suspect we'll end up with a few clear
     levels of "unicode support" by databases that we can then focus on
     more clearly.
this will take some time but I'll work on it. It could be a nightmare for DBD::ODBC as it supports dozens of dbs so I'll have to be a bit cagey about the stuff like "what "Unicode support" is this database capable of".

Contributions welcome.

2. Try to make a data-driven common test script.
     It should fetch the length of the stored value, something like:
         CREATE TABLE t (c VARCHAR(10));
         INSERT INTO t VALUES (?)<=  $sth->execute("\x{263A}") # simley
         SELECT LENGTH(c), c FROM t
     Fetching the LENGTH is important because it tells us if the DB is
     treating the value as Unicode.  The description of DBD::Unify, for
     example, doesn't clarify if the db itself regards the stored value
     as unicode or the underlying string of encoded bytes.

Changed to add length(c) test - sorry misread that first time as Perl code length(c) not SQL - I'm not well enough to think so clearly right now.

Will take some time to run through the DBDs to check whether it is length()/len() or something else - isn't SQL great e.g., DBD::CSV (SQL::Statement) is char_length and most other SQLs are len or length.

BTW, DBD::mysql fails the length test using the length function char_length (documented as Return number of characters in argument) - it returns 3 for the smiley not 1, not investigated why yet.

     Also probably best to avoid latin characters for this, I'd use
     something that always has a multi-byte encoding, like a simley face char.
changed to use smiley face for data but not for table/column name checks as most dbs have other rules on table/column names.

3. Focus on placeholders initially.
     We can ponder utf8 in literal SQL later. That's a separate ball of mud.
     (I'd also ignore unicode table/column/db names. It's a much lower
     priority and may become clearer when other issues get resolved.)
test already does that so left in - it is easy to comment it out - one line change.
4. Tests could report local LANG / LC_ALL env var value
     so when others report their results we'll have that context.
added although I stopped short of outputting everything in %ENV as I guessed people would end up having to edit it to remove stuff - it does LANG, LC_* and NLS_* right now.

Thanks again. I've only given it a quick skim. I'll read it again before LPW.

Meanwhile, it would be great if people could contribute the info for #1.

Tim.

p.s. Using data_diff() http://search.cpan.org/~timb/DBI/DBI.pm#data_diff
would make the tests shorter.
     my $sample_string = "\x{263A}";
     ...
     print data_diff($sample_string, $returned_string);
added or data_diff() to tests which fail -
is($x, $y, "test") or data_diff($x,$y)

Test::More::is output is horrible for unicode on a non UTF-8 enabled terminal.

BTW, the smiley character causes me problems as for some reason with MS SQL Server you can do:

insert into table values(?)
bind_param("\{many_unicode_chars}")
select column from table where column = $h->quote("\{many_unicode_chars}")

but for some reason the smiley does not work in the select SQL even though the insert works :-( A parameterized select works fine too.

I will upload the current script to DBI's subversion tree in ex dir but if anyone is going to do anything major with it I'd rather they mail say dbi-dev or me first and I can at least check that before making any changes myself (at least until after LPW). Having said that as the birthday boy tomorrow I can almost guarantee I won't be working on it tomorrow night. Also, bare in mind I didn't start out writing the test code for inclusion in a public repository and it is very time consuming to check it across multiple DBDs. The do_connect method contains connect calls that work for me here and will need changing if you want to run it against your data sources - it was just too much of a pain to do it via DBI_DSN/DBI_USER/DBI_PASS whilst I was gathering info (especially on Windows - my Linux home box is bust right now).

I hang around #dbi irc. If someone wants to make this test code work with Postgres I'd be happy to apply any changes - Postgres is one I'd really like to include but it is proving too time consuming to add another DBD - especially one where I know active work is going on with unicode/encoding support and I'd have to work with the latest trunk.

Lastly, if anyone knows the pub meeting for LPW on Friday night could they tell me as I've forgotten - lots of people owe me drinks so I need to get there in plenty of time :-)

Martin

Reply via email to