Maybe you're already aware of this, but I've found from experience that troubleshooting encoding/Unicode problems in a web/db app can be difficult, especially with multiple conversions at different stages, but I've come up with a short generic algorithm to help test/ensure that things are working and where things need fixing. Note that these details assuming we're using Perl 5.8+.

1. Make sure all your text/code/template/non-binary/etc files are saved as UTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvy text editor.

2. Have a "use utf8;" at the top of every Perl file, so Perl treats your source files as being Unicode.

3. Place a text string literal in your program code that you know isn't in ASCII ... for example I like to use the word 'サンプル', which is what came out of Google's translation tool when I asked it to translate the word 'sample' to Japanese. Then setup your program to display that text directly in your web page text, without any escaping.

4. Make sure the HTTP response headers for the webpage with that text have a content-type charset value of UTF-8, and make sure that Perl is encoding its output as actual UTF-8; if you were doing it directly using STDOUT for example such as in a CGI, it could be: "binmode *main::STDOUT, ':encoding(UTF-8)';" or such. Make sure your web browser is Unicode savvy.

5. At this point, if the web page displays correctly with the non-ASCII literal (and moreover, if you "view source" in the browser and the literal also displays literally), then you know your program can work/represent internally with Unicode correctly, and it can output Unicode correctly to the browser. It is very important to get this step working first, in isolation, so that you are in a position to judge or troubleshoot other issues such as receiving Unicode input from a browser or using it with a database.

6. Next test that you can receive Unicode from the browser in the various ways, whether by query string / http headers or in an http post. Eg try outputting a value and have the user submit it again, and compare for equality either in the Perl program or by displaying it again next to the original for visual inspection. If any differences come up, then you know any fixes you have to do concern either how you read and interpret the browser request, or perhaps on how you instruct the browser on how to submit a request. Once that's all cleared up, then you know your I/O with the web browser works fine.

7. To test a database, I suggest first using a known-good and Unicode savvy alternate input method for putting some Unicode text in the database, such as using an admin/utility tool that came with the DBMS. Also make sure that the database is itself using UTF-8 character strings in its schema, eg that the schema is declared this way.

8. With a database known to contain some valid Unicode etc text, you first test simply selecting that text from the database and displaying it. If anything doesn't match, it means you probably have to configure your DBMS client connection encoding so it is UTF-8 (often done with a few certain SQL commands), and then separately ensure that Perl is decoding the UTF-8 data into Perl text strings properly. Its important to make sure you can retrieve Unicode from the database properly so that you have a context for judging that you can insert such text in the database.

9. Next try to insert some Unicode text in the database using your program, then select it back to check that it worked. If it didn't, then check DBMS client connection settings, or that Perl is encoding text as UTF-8 properly.

10. Actually, when you have a known-good external tool to help you, you can alternately start the DBMS tests with step 9, where your program inserts text, then you use the known-good tool to ensure it actually was recorded properly.

Anyway, that's it in a nutshell. Now I'm sure many of you have already figured this out, but for those who haven't, I hope these tips help you. Adjust as appropriate to account for any abstraction tools or frameworks you are using which means your tests may also involve testing those tools or configuring them.

-- Darren Duncan

Hugh Hunter wrote:
I've been struggling with this for some time and know there must be an answer out there.

I'm using URL arguments to pass parameters to my controller. It's a site about names, so take the url http://domain.com/name/Jesús (note the accented u). The Name.pm controller has an :Args(1) decorator so Jesús is stored in $name and then passed to my DBIC model in a ->search({name => $name}) call. This doesn't manage to find the row that exists in mysql. When I dump $name I get:

'name' => 'Jes\xc3\xbas'

which I think I understand as being perl's internal escaping of utf-8 characters.

I've done everything recommended on http://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode and the name column in my mysql database uses the utf-8 charset.

Where am I going wrong?


_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

Reply via email to