On Mon, Jul 05, 2010 at 11:02:02PM +0200, Matias E. Fernandez wrote: > Hello Jesse > > I'm pretty sure your data has been UTF-8 encoded twice. Consider this example: > > use strict; > use warnings; > > use Encode; > > # $string is UTF-8, but Perl doesn't know > my $string = 'Pérez-Reverte, Arturo Кири́ллица ქართული 汉字 / 漢'; > # $double_utf8 contains the double UTF-8 encoded string > # note that this is an implicit ISO-8859-1 to UTF-8 conversion > my $double_utf8 = Encode::encode('UTF-8', $string); > > print "double encoded UTF-8:\n", "$double_utf8\n\n"; > > # let Perl believe that $double_utf8 is UTF-8 > Encode::_utf8_on($double_utf8); > # run $double_utf8 through a UTF-8 to ISO-8859-1 conversion > my $double_utf8_to_latin1 = Encode::decode('ISO-8859-1', $double_utf8); > > print "double UTF-8 to ISO-8859-1:\n", "$double_utf8_to_latin1\n\n";
Right, that looks "correct". But this is latin1, not UTF-8, so... > So why is your data in the database double encoded UTF-8? > The problem is that you're not using the mysql_enable_utf8 > option (see the DBD::mysql documentation). If you don't use > that option as a part to the call to 'connect()', DBD::mysql > will the configure the connection in a way that MySQL > believes it's being sent ISO-8859-1. Because you're table is > configured to store character data as UTF-8, MySQL converts > the received data from ISO-8859-1 to UTF-8. There you have > double encoded UTF-8! I am now, but there was a point when I hadn't been, or these tables were first set up as latin-1, or some other screwup. The problem is, the tables do exist now. > The solution is simply to use mysql_enable_utf8 as part of > the call to 'connect()'. If you're using DBIx::Class I > recommend also disabling the mysql_auto_reconnect option, > this will save you a lot of headache. But that doesn't help me right now, it only helps me for the future. That is, I currently have data in the database, some of which is double-encoded UTF-8. If I try to retrieve this, setting mysql_enable_utf8 doesn't help. That is if I take my existing data (e.g. the example I originally posted), connect to MySQL with mysql_enable_utf8, and pull the data with a Perl script, I still get junk. In your above example you show how to un-double-encode the data I have, but only by turning it into latin1, right? How do I take my existing data and turn it into proper UTF-8, at which point I can make sure everything is set correctly so that I never have this problem again? Thanks for looking at this so closely. Jesse Sheidlower _______________________________________________ List: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/dbix-class IRC: irc.perl.org#dbix-class SVN: http://dev.catalyst.perl.org/repos/bast/DBIx-Class/ Searchable Archive: http://www.grokbase.com/group/dbix-class@lists.scsys.co.uk