Re: DBD::ODBC tricky unicode problem

2013-10-27 Thread Martin J. Evans
I've snipped my original email and Tim's response and started again 
based on Tim's comments in the email and on IRC. So


o this concentrates on input to the database and ignores output.
o only looks so far at inserting into varchars (and not nvarchars which 
I currently believe works but we'll come back to that).
o includes examples of different inputs, lengths in perl and what you 
get back.
o test code and all discussion is windows only for now as ODBC drivers 
on Unix can work in very different ways. Also, test code run to MS SQL 
Server which in my experience is one of the best ODBC drivers for Windows.
o you cannot correctly translate chr data from a client to a server 
using the SQL Server ODBC driver if the client code page differs from 
the server code page - see http://support.microsoft.com/kb/234748.
o MS quite clearly state If your server must store data from multiple 
code pages, the supported solution is to store the data in Unicode 
columns (NCHAR/NVARCHAR/NTEXT). see 
http://support.microsoft.com/kb/234748 again.

o test code included at end of mail

A few things you should know:

1. D:O does not know what codepage you are in and there is nothing in 
ODBC which allows it to find that out or even say it matters.
2. I believe the TDS protocol (what MS SQL Server uses) allows a client 
code page to be sent, however I believe that is irrelevant now as the 
SQL Server driver converts your client code page chrs into unicode 
before sending them to SQL Server.
3. You cannot expect to get back chrs you inserted into char/varchar 
columns if you did not turn on AutoTranslate in the driver (which 
defaults to on).
4. D:O converts Perl unicode scalars to unicode on Windows using 
MultiByteToWideChar and setting CP_UTF8.


ok, so diving into the example, here is hopefully fairly self 
explanatory output from the test code (code at end of email). I've 
interleaved my comments starting with #.


Current active console code page: 1252
DBD::ODBC build for unicode:1
Output connstr: DSN=asus2;UID=sa;PWD=easysoft;APP=Strawberry Perl 
(64-bit);WSID=ASUS2;Network=DBMSSOCN;Address=ASUS2\SQLEXPRESS,1433


Inserting a unicode euro, utf8 flag on:
  input string: €
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,
# what happened here is SQLDecribeParam described the parameter as 
SQL_CHAR and that is what D:O bound it as. The driver converted 82 in 
code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs
# the probably mistake here is that D:O should have looked at the perl 
data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) 
even though SQLDescribeParam said it was


Inserting a UTF-8 encoded unicode euro, utf8 flag off:
\x{0082} does not map to cp1252 at test_code.pl line 36.
# the above resulted from a print to stdout in windows-1252
# you'd expect that since we were outputting bytes to the terminal
  input string: â\x{0082}¬
  data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 
bytes

  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,
# same thing as first case happened here except D:O couldn't have done 
anything else as Perl said the scalar was not unicode and SQL Server 
said the parameter was SQL_CHAR


Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
  input string: €
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 1
  data_string_desc of output string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of output string:20ac,
# the above is arguably what should have happened in the first test case 
i.e., D:O should have bound as SQL_WVARCHAR because the perl data was 
unicode. I included this one to show you what happens if you do it right.


Inserting a unicode U+187 which is not in the current code page:
\x{0187} does not map to cp1252 at test_code.pl line 36.
# you'd expect this warning here on printing to stdout
  input string: \x{0187}
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 2 
bytes

  ords of input string: 187,
  bytes of input string: c6,87,
  database length: 2
  data_string_desc of output string: UTF8 on, non-ASCII, 2 characters 5 
bytes

  ords of output string:c6,2021,
# here again, down to D:O binding as SQL_CHAR, SQL Server interpreted 
this as 2 chrs in the client code page, and converted them to chr c6 and 
a double dagger


Inserting a unicode U+187 which is not in the current code page with 
forced binding:

\x{0187} does not map to cp1252 at test_code.pl line 36.
  input string: \x{0187}
  

Re: DBD::ODBC tricky unicode problem

2013-10-27 Thread Tim Bunce
On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote:
 
 Inserting a unicode euro, utf8 flag on:
   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters
 3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
 characters 7 bytes
   ords of output string:e2,201a,ac,
 # what happened here is SQLDecribeParam described the parameter as
 SQL_CHAR and that is what D:O bound it as. The driver converted 82
 in code page 1252 to 201a (a magic quote) and the database sees this
 as 3 chrs
 # the probably mistake here is that D:O should have looked at the
 perl data, seen it was unicode and not bound it as SQL_CHAR (but
 SQL_WCHAR) even though SQLDescribeParam said it was

Agreed.

 Inserting a UTF-8 encoded unicode euro, utf8 flag off:
 \x{0082} does not map to cp1252 at test_code.pl line 36.

I'd caution against using phases like UTF-8 encoded unicode euro, utf8
flag off. From the application's perspective it's not a euro, it's just
a sequence of bytes (that just happens to match what a euro unicode
codepoint would look like when UTF-8 encoded).

To put it another way, if the application has that string of bytes and
thinks it's a euro then the application is almost certainly broken.

 Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
 ...
 # the above is arguably what should have happened in the first test
 case i.e., D:O should have bound as SQL_WVARCHAR because the perl
 data was unicode. I included this one to show you what happens if
 you do it right.

Agreed.

 So, I'm thinking the problem above is D:O ignores utf8 flag on
 parameters when they are bound and uses whatever SQLDescribeParam
 says instead (SQL_CHAR in case of varchar columns). If it didn't
 ignore the utf8 flag here, it would have to rebind on every execute
 (which it may already do, I didn't check).

Agreed.

Great. Progress! :)

So, the next question is what are the implications of fixing it for
existing applications? Do you need a deprecation cycle with warnings etc?

Tim.


Re: DBD::ODBC tricky unicode problem

2013-10-27 Thread Martin J. Evans

On 27/10/2013 14:49, Tim Bunce wrote:

Thanks to Tim for being (as usual) a good sounding board.


On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote:


Inserting a unicode euro, utf8 flag on:
   input string: €
   data_string_desc of input string: UTF8 on, non-ASCII, 1 characters
3 bytes
   ords of input string: 20ac,
   bytes of input string: e2,82,ac,
   database length: 3
   data_string_desc of output string: UTF8 on, non-ASCII, 3
characters 7 bytes
   ords of output string:e2,201a,ac,
# what happened here is SQLDecribeParam described the parameter as
SQL_CHAR and that is what D:O bound it as. The driver converted 82
in code page 1252 to 201a (a magic quote) and the database sees this
as 3 chrs
# the probably mistake here is that D:O should have looked at the
perl data, seen it was unicode and not bound it as SQL_CHAR (but
SQL_WCHAR) even though SQLDescribeParam said it was


Agreed.


Good. That is also easily done assuming I check params are rebound in 
case someone tries to insert code page chrs followed by unicode.



Inserting a UTF-8 encoded unicode euro, utf8 flag off:
\x{0082} does not map to cp1252 at test_code.pl line 36.


I'd caution against using phases like UTF-8 encoded unicode euro, utf8
flag off. From the application's perspective it's not a euro, it's just
a sequence of bytes (that just happens to match what a euro unicode
codepoint would look like when UTF-8 encoded).


Yeah, I realised that, I only threw that one in because I've come across 
people doing that i.e., encoding unicode and inserting the encoded 
octets, and I wanted to cover all bases.



To put it another way, if the application has that string of bytes and
thinks it's a euro then the application is almost certainly broken.


see above.


Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR:
...
# the above is arguably what should have happened in the first test
case i.e., D:O should have bound as SQL_WVARCHAR because the perl
data was unicode. I included this one to show you what happens if
you do it right.


Agreed.


So, I'm thinking the problem above is D:O ignores utf8 flag on
parameters when they are bound and uses whatever SQLDescribeParam
says instead (SQL_CHAR in case of varchar columns). If it didn't
ignore the utf8 flag here, it would have to rebind on every execute
(which it may already do, I didn't check).


Agreed.

Great. Progress! :)

So, the next question is what are the implications of fixing it for
existing applications? Do you need a deprecation cycle with warnings etc?

Tim.



Now we've got to this point we /can/ consider what happens when you read 
the data back. By default, in a unicode build of DBD::ODBC ALL string 
data is bound as SQL_WCHARs and I'm of the opinion that this is right 
and should continue (apart from anything else the only alternative is to 
bind as SQL_CHAR when the column is char/varchar and this doesn't help 
at all as the server will have to map server codepage chrs to client 
codepage chrs and this cannot always work). So let's consider what 
happens for each example:


Inserting a unicode euro, utf8 flag on:
  input string: €
  data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 
bytes

  ords of input string: 20ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,

So, here you didn't get back what you put in and the database didn't see 
the inserted data as a single euro chr but as 3 individual chrs in the 
client codepage which was mapped to the column codepage. When read back 
it mapped from the column code page to unicode as the select bound chrs 
as SQL_WCHAR. I'm struggling to see the value to anyone of this 
behaviour but I'd love to hear from you if you don't agree.


Inserting a UTF-8 encoded unicode euro, utf8 flag off:
  In other words a stream of octets
\x{0082} does not map to cp1252 at test_code.pl line 36.
  input string: â\x{0082}¬
  data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 
bytes

  ords of input string: e2,82,ac,
  bytes of input string: e2,82,ac,
  database length: 3
  data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 
bytes

  ords of output string:e2,201a,ac,

This is just the same as the first example and as such I cannot see the 
usefulness of it. However, because I know this I am at a loss as to why 
no one has told me this before I discovered it for myself. I've spoken 
to a few people actively using DBD::ODBC on Windows and mostly they are 
not using unicode for inserts/updates. If they restrict their inserts to 
ASCII there will be no change for them as what they get back now is 
already unicode (just the first 127 chrs of unicode) because DBD::ODBC 
always binds as SQL_WCHAR. If you insert chrs in the upper part of your 
codepage then what you get back is not what you inserted i.e., it is not 
chrs in your codepage, it is unicode and the