* Matias E. Fernandez <pi...@gmx.ch> [100512 07:43]: > On 2010-05-12, at 15:40, Marc Mims wrote: > >> my $title = "\x{e4}\x{f6}\x{fc}"; # "äöü" > > > > This isn't a UTF-8 string. > > > > utf8::is_utf8($title); # false > > > > utf8::upgrade($title); # now it is > > It is a string consisting of the three characters \x{e4}, \x{f6} and \x{fc}. > That's about all I have to know as a Perl user, reread [1] if in doubt. The > important thing to know is that you cannot rely on Perl internally holding > strings in UTF-8! Of course I could force Perl to internally hold this string > in UTF-8 by using utf8::upgrade(), but the question is: where should I do > that so as to cover all cases? As pointed out in [2], overwriting get_columns > and store_columns won't work reliably. That's why I suggested using the > inflate/deflate subroutines, but will this work in all cases? Even then it > would be a bad idea to use utf8::upgrade() because that's not was it's meant > for. As pointed out in [3] the flow should be as follows:
No. It's a string consisting of 3 bytes that happen to be latin-1 characters. If you're going to feed them to a module that expects UTF-8, you need to make them UTF-8, first. > > 1. Receive and decode > > 2. Process > > 3. Encode and output Correct. The byte string in $title hasn't been decoded, yet. I'm not a UNICODE expert. I've struggled mightily with it and seem to have eventually got it right in Net::Twitter. And I've had no trouble with DBD::mysql with mysql_enable_utf8 set and either the default encoding or specific columns set to utf8 in the DDL. But in order for it to work, you have to send it decode utf8 strings, not latin-1 strings. You're quite correct that you shouldn't have to worry about what the internal internal representation is, in perl. As long as you decode input and encode output, you should be good. With DBD::mysql/mysql_enable_utf8, you send it decoded utf8 and you get back decoded utf8. It takes care of the decoding and encoding on input/output for you. > and as a matter of fact, neither DBIx::Class nor DBD::mysql do the 3rd step > (encoding to UTF-8), because then the problem would not arise. Look at this: > > my $title = "\x{e4}\x{f6}\x{fc}"; > return Encode::encode('UTF-8', $title); > > and > > my $other_title = "\x{e4}\x{f6}\x{fc}"; > utf8::upgrade($other_title); > return Encode::encode('UTF-8', $other_title); > > Both yield the same result. Using utf8::upgrade() here is useless, and again: > as pointed out in [1] you shouldn't care about the internal format. Perl itself understands that $title is latin-1, and when you encode it to utf8, it does the right thing. DBD::mysql isn't quite as smart. It expects a decoded utf8 string, so the utf8::upgrade is necessary. > My question remains: is deflate/inflate a safe place to do encoding, or will > it suffer the same flaws as DBIx::Class::UTF8Columns? I don't think deflate/inflate is the correct place. That's serializing and de-serializing objects. If you're using DBD::mysql, you can simply use the mysql_enable_utf8 flag and you won't need DBIx::Class::UTF8Columns [1]. [1] http://search.cpan.org/~frew/DBIx-Class-0.08121/lib/DBIx/Class/UTF8Columns.pm#Warning_-_Native_Database_Unicode_Support -Marc _______________________________________________ List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class IRC: irc.perl.org#dbix-class SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/ Searchable Archive: http://www.grokbase.com/group/dbix-class@lists.scsys.co.uk