On 30.09.2009 15:39, Martin Evans wrote:
Does your setup pass the DBD::ODBC tests?
No, it does not:

t/40UnicodeRoundTrip.t
At least this test should pass without warnings and errors. If it doesn't, the following Unicode tests do not make sense at all.


You are entering a world of pain.

Right. Unicode is too young in computer terms ... ;-)
And the various encodings and Unicode versions don't make things easier.

use encoding xxx

This is used in Perl to say your script is encoded in xxx. Just because
you have and accept UTF-8 encoded data does mean you need to "use
encoding" but if your script is encoded in xxx you need "use encoding
xxx". For instance, the example Hendrik gave you includes unicode
characters but does not need encoding. As a result, I cannot see how
adding "use encoding 'utf-8'" should make any difference to data
returned from sql server through DBD::ODBC.
It can make a difference, if you add "use encoding 'utf-8';" to a script that is really encoded as iso-8859-1 or if you don't add it to a script encoded as UTF-8 *and* the script contains non-ASCII string literals. In both cases, you end with strings where encoding and UTF-8 flag do not match.

Example 1:

#!/usr/bin/perl -w
use strict;
use encoding "utf-8"; # but file is encoded as iso-8859-1
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as iso-8859-1
print "ok\n";

Output:

Malformed UTF-8 character (unexpected non-continuation byte 0xd6, immediately after start byte 0xc4) at test.pl line 4. Malformed UTF-8 character (unexpected non-continuation byte 0xdc, immediately after start byte 0xd6) at test.pl line 4. Malformed UTF-8 character (1 byte, need 2, after start byte 0xdc) at test.pl line 4.
encoding mismatch at test.pl line 4.

Example 2:

#!/usr/bin/perl -w
use strict;
# no "use encoding "utf-8";", but file is encoded as UTF-8
("ÄÖÜ" eq "\x{00C4}\x{00D6}\x{00DC}") or die "encoding mismatch";
# ^-- literal german umlauts, upper case, encoded as UTF-8
print "ok\n";

Output:

encoding mismatch at test.pl line 4.


Note that Example 2 does not give you any warnings, as ISO-8859-1 does not have any invalid byte sequences. Perl sees the left-hand side of eq as a string literal containg six(!) characters encoded as ISO-8859-1 (those 6 bytes that encode ÄÖÜ in UTF-8), that literal has its UTF-8 flag turned off. The right-hand side is a string literal containing three UTF-8 characters, internally stored as the same six bytes, but with the UTF-8 flag turned on. A string of six characters cannot be the same as a string of three characters, so the eq expression is false.

In Example 1, Perl sees three(!) bytes(!) in the string literal on the left-hand side of eq that do not represend a valid UTF-8 string, hence the three warnings. Still, the string has a length of three characters and has its UTF-8 flag set. The right-hand side is the same as in Example 2, but the binary junk is not equal to "ÄÖÜ", so again, the eq expression is false.

t/40UnicodeRoundTrip.t is intentionally written using \x{0000} sequences instead of non-ASCII literals to prevent this special problem. And it has four paranoia tests (utf8::is_utf8(...) in the BEGIN block) to absolutely make sure the test data has the UTF-8 flag set or cleared as expected.

t/UChelp.pm has a dumpstr() function that dumps the unicode string in pure ASCII using \x00 or \x{0000} sequences, including length and UTF-8 flag. It prevents the unwanted side effect of a UTF-8-capable terminal that displays bytes written by Perl as Unicode characters, even if they were ment to be non-unicode.


Alexander


--
Alexander Foken
mailto:alexan...@foken.de  http://www.foken.de/alexander/

Reply via email to