Hello again On 2010-05-13, at 24:19, Marc Mims wrote:
>> I disagree with that. Consider this: >> >> my $string = "\x{e4}\x{f6}\x{fc}"; >> utf8::upgrade($string); >> >> my $other_string = "\x{e4}\x{f6}\x{fc}"; >> >> ok($string eq $other_string, "upgraded and not upgraded character strings >> are equal"); >> >> Both $string and $other_string a perfectly valid Perl character strings, and >> they are equal. How Perl holds them internally doesn't and shouldn't matter. > > Unfortunately, it does matter. Perl supports 2 types of strings: byte > strings and unicode strings. For legacy reasons, byte-strings are > interpreted as latin-1. In your example, $string (after the > utf8::upgrade) is a unicode string. $other_string is not. DBD::mysql > with mysql_enable_utf8 will be happy with $string but apparently isn't > happy with $other_string. The important point is not that byte-strings are interpreted as Latin-1 in some cases, but that Perl tries to keep its data as eight-bit bytes for as long as possible! From perluniintro [1]: > Perl supports both pre-5.6 strings of eight-bit native bytes, and strings of > Unicode characters. The principle is that > Perl tries to keep its data as eight-bit bytes for as long as possible, but > as soon as Unicodeness cannot be > avoided, the data is (mostly) transparently upgraded to Unicode. There are > some problems--see "The Unicode > Bug" in perlunicode. > > Internally, Perl currently uses either whatever the native eight-bit > character set of the platform (for example Latin-1) > is, defaulting to UTF-8, to encode Unicode strings. Specifically, if all code > points in the string are 0xFF or less, Perl > uses the native eight-bit character set. Otherwise, it uses UTF-8. > > A user of Perl does not normally need to know nor care how Perl happens to > encode its internal strings, but it > becomes relevant when outputting Unicode strings to a stream without a PerlIO > layer (one with the "default" > encoding). In such a case, the raw bytes used internally (the native > character set or UTF-8, as appropriate for each > string) will be used, and a "Wide character" warning will be issued if those > strings contain a character beyond > 0x00FF. Note the first sentence of the third paragraph: "A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings"! I really recommend reading the the whole chapter and having a look at the examples! The problem arises because DBD::mysql sends data out without using a PerlIO layer, otherwise there would be no problem with $other_string! Again, a user should never have to mess around with Perl internals as the like of utf8::upgrade(). I repeat that it is not correct not to encode data if you want to send UTF-8! Suppose an UTF-8 shell and the following example: my $var = "\x{fc}bercool \x{263a}"; print $var,"\n"; Perl will issue a warning "Wide character in print at -e line 1." although everything looks fine in the terminal. The correct way in this situation would be something like this: my $var = "\x{fc}bercool \x{263a}"; binmode(STDOUT, ":encoding(UTF-8)"); print $var,"\n"; or my $var = "\x{fc}bercool \x{263a}"; print Encode::encode("UTF-8", $var),"\n"; Everything works as expected and there are no warnings. Consider the following: my $a_string = "\x{fc}bercool :-)"; # "übercool :-)" my $another_string = "\x{fc}bercool \x{263a}"; # "übercool ☺" Does it really make sense that a library does not work as expected with $a_string, but does so with $another_string? I think it doesn't! Using the \x{...} notation is absolutely okay, it only has implications if a library you use doesn't respect Perl's Unicode model! > I was just trying to be helpful. Like I said, I'm no unicode expert. Thank you very much. I'm trying to be helpful too by pointing at an existing and real problem. Regards Matias E. Fernandez [1] http://perldoc.perl.org/5.12.0/perluniintro.html#Perl's-Unicode-Model _______________________________________________ List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class IRC: irc.perl.org#dbix-class SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/ Searchable Archive: http://www.grokbase.com/group/dbix-class@lists.scsys.co.uk