Re: DBD::ODBC and character sets

Alexander Foken Wed, 30 Sep 2009 12:31:10 -0700

On 30.09.2009 15:39, Martin Evans wrote:

Does your setup pass the DBD::ODBC tests?

No, it does not:

t/40UnicodeRoundTrip.t

At least this test should pass without warnings and errors. If itdoesn't, the following Unicode tests do not make sense at all.

You are entering a world of pain.


Right. Unicode is too young in computer terms ... ;-)
And the various encodings and Unicode versions don't make things easier.

use encoding xxx

This is used in Perl to say your script is encoded in xxx. Just because
you have and accept UTF-8 encoded data does mean you need to "use
encoding" but if your script is encoded in xxx you need "use encoding
xxx". For instance, the example Hendrik gave you includes unicode
characters but does not need encoding. As a result, I cannot see how
adding "use encoding 'utf-8'" should make any difference to data
returned from sql server through DBD::ODBC.

It can make a difference, if you add "use encoding 'utf-8';" to a scriptthat is really encoded as iso-8859-1 or if you don't add it to a scriptencoded as UTF-8 *and* the script contains non-ASCII string literals. Inboth cases, you end with strings where encoding and UTF-8 flag do not match.


Example 1:

#!/usr/bin/perl -w
use strict;
use encoding "utf-8"; # but file is encoded as iso-8859-1
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as iso-8859-1
print "ok\n";

Output:

Malformed UTF-8 character (unexpected non-continuation byte 0xd6,immediately after start byte 0xc4) at test.pl line 4.Malformed UTF-8 character (unexpected non-continuation byte 0xdc,immediately after start byte 0xd6) at test.pl line 4.Malformed UTF-8 character (1 byte, need 2, after start byte 0xdc) attest.pl line 4.

encoding mismatch at test.pl line 4.

Example 2:

#!/usr/bin/perl -w
use strict;
# no "use encoding "utf-8";", but file is encoded as UTF-8
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as UTF-8
print "ok\n";

Output:

encoding mismatch at test.pl line 4.

Note that Example 2 does not give you any warnings, as ISO-8859-1 doesnot have any invalid byte sequences. Perl sees the left-hand side of eqas a string literal containg six(!) characters encoded as ISO-8859-1(those 6 bytes that encode ÄÖÜ in UTF-8), that literal has its UTF-8flag turned off. The right-hand side is a string literal containingthree UTF-8 characters, internally stored as the same six bytes, butwith the UTF-8 flag turned on. A string of six characters cannot be thesame as a string of three characters, so the eq expression is false.

In Example 1, Perl sees three(!) bytes(!) in the string literal on theleft-hand side of eq that do not represend a valid UTF-8 string, hencethe three warnings. Still, the string has a length of three charactersand has its UTF-8 flag set. The right-hand side is the same as inExample 2, but the binary junk is not equal to "ÄÖÜ", so again, the eqexpression is false.

t/40UnicodeRoundTrip.t is intentionally written using \x{0000} sequencesinstead of non-ASCII literals to prevent this special problem. And ithas four paranoia tests (utf8::is_utf8(...) in the BEGIN block) toabsolutely make sure the test data has the UTF-8 flag set or cleared asexpected.

t/UChelp.pm has a dumpstr() function that dumps the unicode string inpure ASCII using \x00 or \x{0000} sequences, including length and UTF-8flag. It prevents the unwanted side effect of a UTF-8-capable terminalthat displays bytes written by Perl as Unicode characters, even if theywere ment to be non-unicode.



Alexander


--
Alexander Foken
mailto:alexan...@foken.de  http://www.foken.de/alexander/

Re: DBD::ODBC and character sets

Reply via email to