On 09/05/11 22:06, Tim Bunce wrote:
On Mon, May 09, 2011 at 07:42:53PM +0100, Martin J. Evans wrote:
I've recently had an rt posted
(http://rt.cpan.org/Public/Bug/Display.html?id=67994) after a
discussion on stackoverflow 
(http://stackoverflow.com/questions/5912082/automatic-character-encoding-handling-in-perl-dbi-dbdodbc).

In this case the Perl script is binding the columns but the data
returned is windows-1252 and the user is having to manually
Encode::decode all bound columns. DBD::ODBC already had a
odbc_utf8_on flag
(http://search.cpan.org/~mjevans/DBD-ODBC-1.29/ODBC.pm#odbc_utf8_on)
for a derivative of Postgres which returns bound data UTF-8 encoded
but in that case I can just call sv_utf8_decode (in the XS) and it
is converted in place. Initially I thought I could combine
odbc_utf8_on into a new flag saying my data is returned as xxx and
just call Encode::decode with xxx (then eventually I could drop
odbc_utf8_on).

I'd be wary of going down this path.  I sense pain just beyond the horizon.
A twisty-turny maze of sharp edge cases and unforseen issues.

So do I which is why I am still thinking about it and why I value any other 
opinions/ideas.
For a start: What about the charset of bind values?  What about the
charset of SQL literals?

Both of those are inputs but this case is data retrieved from MS SQL Server. I see this as different. You can ensure your Perl data is inserted into a unicode aware database via ODBC by ensuring your data is encoded as unicode - DBD::ODBC will spot that and bind the data as SQL_WCHAR or pass the SQL to SQLPrepareW/SQLExecDirectW (after converting it to UCS2). The opposite for retrieving data is not true unless you specifically bind all char types as SQL_WCHAR as this is not the default in DBD::ODBC for varchar columns (just nvarchar columns). Perhaps your thinking now this is where the flaw really is but it is more difficult than that. DBD::ODBC asks the database what the column type is and if it says it is an SQL_CHAR is is bound as an SQL_CHAR (retrieving bytes in whatever encoding the database and client libs decide) and if it says it is an SQL_WCHAR it is bound as an SQL_WCHAR (and returned as UCS2 which DBD::ODBC happily converts for you). In this guy's case, the column is identified as SQL_CHAR, it is bound as SQL_CHAR and the returned data is windows-1252 and DBD::ODBC knows nothing about it. I appreciate his problem to a degree as it is a pain to have to decode every bound column.

Of course, I could change DBD::ODBC when built in unicode mode to bind all 
SQL_CHAR columns as SQL_WCHAR and that would also solve his problem but a) it 
would be more expensive for those who don't need to do that b) I'd worry it 
might break existing code e.g., anyone like him who is already decoding the 
data themselves.

Can't the database connection/session settings be altered to assume utf8
at the client end and have the server or client libs automatically
convert for you? If so, that's a good way to go.

Tim.

Perhaps although at this stage I don't know how and in any case even if they 
could be returned utf-8 encoded I would need to know they are utf-8 encoded to 
do anything with them. The additional problem is that UTF-8 encoded data may be 
longer that the size of the column as reported to DBD::ODBC so it would need to 
know this in advance and multiple all buffers by 4 or perhaps even 6 (can't 
remember right now if unicode goes that far).

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com

Reply via email to