Re: DBD::ODBC tricky unicode problem
I've snipped my original email and Tim's response and started again based on Tim's comments in the email and on IRC. So o this concentrates on input to the database and ignores output. o only looks so far at inserting into varchars (and not nvarchars which I currently believe works but we'll come back to that). o includes examples of different inputs, lengths in perl and what you get back. o test code and all discussion is windows only for now as ODBC drivers on Unix can work in very different ways. Also, test code run to MS SQL Server which in my experience is one of the best ODBC drivers for Windows. o you cannot correctly translate chr data from a client to a server using the SQL Server ODBC driver if the client code page differs from the server code page - see http://support.microsoft.com/kb/234748. o MS quite clearly state If your server must store data from multiple code pages, the supported solution is to store the data in Unicode columns (NCHAR/NVARCHAR/NTEXT). see http://support.microsoft.com/kb/234748 again. o test code included at end of mail A few things you should know: 1. D:O does not know what codepage you are in and there is nothing in ODBC which allows it to find that out or even say it matters. 2. I believe the TDS protocol (what MS SQL Server uses) allows a client code page to be sent, however I believe that is irrelevant now as the SQL Server driver converts your client code page chrs into unicode before sending them to SQL Server. 3. You cannot expect to get back chrs you inserted into char/varchar columns if you did not turn on AutoTranslate in the driver (which defaults to on). 4. D:O converts Perl unicode scalars to unicode on Windows using MultiByteToWideChar and setting CP_UTF8. ok, so diving into the example, here is hopefully fairly self explanatory output from the test code (code at end of email). I've interleaved my comments starting with #. Current active console code page: 1252 DBD::ODBC build for unicode:1 Output connstr: DSN=asus2;UID=sa;PWD=easysoft;APP=Strawberry Perl (64-bit);WSID=ASUS2;Network=DBMSSOCN;Address=ASUS2\SQLEXPRESS,1433 Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # what happened here is SQLDecribeParam described the parameter as SQL_CHAR and that is what D:O bound it as. The driver converted 82 in code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs # the probably mistake here is that D:O should have looked at the perl data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) even though SQLDescribeParam said it was Inserting a UTF-8 encoded unicode euro, utf8 flag off: \x{0082} does not map to cp1252 at test_code.pl line 36. # the above resulted from a print to stdout in windows-1252 # you'd expect that since we were outputting bytes to the terminal input string: â\x{0082}¬ data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 bytes ords of input string: e2,82,ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # same thing as first case happened here except D:O couldn't have done anything else as Perl said the scalar was not unicode and SQL Server said the parameter was SQL_CHAR Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 1 data_string_desc of output string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of output string:20ac, # the above is arguably what should have happened in the first test case i.e., D:O should have bound as SQL_WVARCHAR because the perl data was unicode. I included this one to show you what happens if you do it right. Inserting a unicode U+187 which is not in the current code page: \x{0187} does not map to cp1252 at test_code.pl line 36. # you'd expect this warning here on printing to stdout input string: \x{0187} data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 2 bytes ords of input string: 187, bytes of input string: c6,87, database length: 2 data_string_desc of output string: UTF8 on, non-ASCII, 2 characters 5 bytes ords of output string:c6,2021, # here again, down to D:O binding as SQL_CHAR, SQL Server interpreted this as 2 chrs in the client code page, and converted them to chr c6 and a double dagger Inserting a unicode U+187 which is not in the current code page with forced binding: \x{0187} does not map to cp1252 at test_code.pl line 36. input string: \x{0187}
Re: DBD::ODBC tricky unicode problem
On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote: Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # what happened here is SQLDecribeParam described the parameter as SQL_CHAR and that is what D:O bound it as. The driver converted 82 in code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs # the probably mistake here is that D:O should have looked at the perl data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) even though SQLDescribeParam said it was Agreed. Inserting a UTF-8 encoded unicode euro, utf8 flag off: \x{0082} does not map to cp1252 at test_code.pl line 36. I'd caution against using phases like UTF-8 encoded unicode euro, utf8 flag off. From the application's perspective it's not a euro, it's just a sequence of bytes (that just happens to match what a euro unicode codepoint would look like when UTF-8 encoded). To put it another way, if the application has that string of bytes and thinks it's a euro then the application is almost certainly broken. Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR: ... # the above is arguably what should have happened in the first test case i.e., D:O should have bound as SQL_WVARCHAR because the perl data was unicode. I included this one to show you what happens if you do it right. Agreed. So, I'm thinking the problem above is D:O ignores utf8 flag on parameters when they are bound and uses whatever SQLDescribeParam says instead (SQL_CHAR in case of varchar columns). If it didn't ignore the utf8 flag here, it would have to rebind on every execute (which it may already do, I didn't check). Agreed. Great. Progress! :) So, the next question is what are the implications of fixing it for existing applications? Do you need a deprecation cycle with warnings etc? Tim.
Re: DBD::ODBC tricky unicode problem
On 27/10/2013 14:49, Tim Bunce wrote: Thanks to Tim for being (as usual) a good sounding board. On Sun, Oct 27, 2013 at 12:18:53PM +, Martin J. Evans wrote: Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, # what happened here is SQLDecribeParam described the parameter as SQL_CHAR and that is what D:O bound it as. The driver converted 82 in code page 1252 to 201a (a magic quote) and the database sees this as 3 chrs # the probably mistake here is that D:O should have looked at the perl data, seen it was unicode and not bound it as SQL_CHAR (but SQL_WCHAR) even though SQLDescribeParam said it was Agreed. Good. That is also easily done assuming I check params are rebound in case someone tries to insert code page chrs followed by unicode. Inserting a UTF-8 encoded unicode euro, utf8 flag off: \x{0082} does not map to cp1252 at test_code.pl line 36. I'd caution against using phases like UTF-8 encoded unicode euro, utf8 flag off. From the application's perspective it's not a euro, it's just a sequence of bytes (that just happens to match what a euro unicode codepoint would look like when UTF-8 encoded). Yeah, I realised that, I only threw that one in because I've come across people doing that i.e., encoding unicode and inserting the encoded octets, and I wanted to cover all bases. To put it another way, if the application has that string of bytes and thinks it's a euro then the application is almost certainly broken. see above. Inserting a unicode euro, utf8 flag on, forced SQL_WVARCHAR: ... # the above is arguably what should have happened in the first test case i.e., D:O should have bound as SQL_WVARCHAR because the perl data was unicode. I included this one to show you what happens if you do it right. Agreed. So, I'm thinking the problem above is D:O ignores utf8 flag on parameters when they are bound and uses whatever SQLDescribeParam says instead (SQL_CHAR in case of varchar columns). If it didn't ignore the utf8 flag here, it would have to rebind on every execute (which it may already do, I didn't check). Agreed. Great. Progress! :) So, the next question is what are the implications of fixing it for existing applications? Do you need a deprecation cycle with warnings etc? Tim. Now we've got to this point we /can/ consider what happens when you read the data back. By default, in a unicode build of DBD::ODBC ALL string data is bound as SQL_WCHARs and I'm of the opinion that this is right and should continue (apart from anything else the only alternative is to bind as SQL_CHAR when the column is char/varchar and this doesn't help at all as the server will have to map server codepage chrs to client codepage chrs and this cannot always work). So let's consider what happens for each example: Inserting a unicode euro, utf8 flag on: input string: € data_string_desc of input string: UTF8 on, non-ASCII, 1 characters 3 bytes ords of input string: 20ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, So, here you didn't get back what you put in and the database didn't see the inserted data as a single euro chr but as 3 individual chrs in the client codepage which was mapped to the column codepage. When read back it mapped from the column code page to unicode as the select bound chrs as SQL_WCHAR. I'm struggling to see the value to anyone of this behaviour but I'd love to hear from you if you don't agree. Inserting a UTF-8 encoded unicode euro, utf8 flag off: In other words a stream of octets \x{0082} does not map to cp1252 at test_code.pl line 36. input string: â\x{0082}¬ data_string_desc of input string: UTF8 off, non-ASCII, 3 characters 3 bytes ords of input string: e2,82,ac, bytes of input string: e2,82,ac, database length: 3 data_string_desc of output string: UTF8 on, non-ASCII, 3 characters 7 bytes ords of output string:e2,201a,ac, This is just the same as the first example and as such I cannot see the usefulness of it. However, because I know this I am at a loss as to why no one has told me this before I discovered it for myself. I've spoken to a few people actively using DBD::ODBC on Windows and mostly they are not using unicode for inserts/updates. If they restrict their inserts to ASCII there will be no change for them as what they get back now is already unicode (just the first 127 chrs of unicode) because DBD::ODBC always binds as SQL_WCHAR. If you insert chrs in the upper part of your codepage then what you get back is not what you inserted i.e., it is not chrs in your codepage, it is unicode and the