[Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

Darren Duncan Sat, 27 Sep 2008 15:40:17 -0700

Maybe you're already aware of this, but I've found from experience thattroubleshooting encoding/Unicode problems in a web/db app can be difficult,especially with multiple conversions at different stages, but I've come upwith a short generic algorithm to help test/ensure that things are workingand where things need fixing. Note that these details assuming we're usingPerl 5.8+.

1. Make sure all your text/code/template/non-binary/etc files are saved asUTF-8 text files (or they are 7-bit ASCII), and you have a Unicode-savvytext editor.

2. Have a "use utf8;" at the top of every Perl file, so Perl treats yoursource files as being Unicode.

3. Place a text string literal in your program code that you know isn't inASCII ... for example I like to use the word 'サンプル', which is what cameout of Google's translation tool when I asked it to translate the word'sample' to Japanese. Then setup your program to display that textdirectly in your web page text, without any escaping.

4. Make sure the HTTP response headers for the webpage with that text havea content-type charset value of UTF-8, and make sure that Perl is encodingits output as actual UTF-8; if you were doing it directly using STDOUT forexample such as in a CGI, it could be: "binmode *main::STDOUT,':encoding(UTF-8)';" or such. Make sure your web browser is Unicode savvy.

5. At this point, if the web page displays correctly with the non-ASCIIliteral (and moreover, if you "view source" in the browser and the literalalso displays literally), then you know your program can work/representinternally with Unicode correctly, and it can output Unicode correctly tothe browser. It is very important to get this step working first, inisolation, so that you are in a position to judge or troubleshoot otherissues such as receiving Unicode input from a browser or using it with adatabase.

6. Next test that you can receive Unicode from the browser in the variousways, whether by query string / http headers or in an http post. Eg tryoutputting a value and have the user submit it again, and compare forequality either in the Perl program or by displaying it again next to theoriginal for visual inspection. If any differences come up, then you knowany fixes you have to do concern either how you read and interpret thebrowser request, or perhaps on how you instruct the browser on how tosubmit a request. Once that's all cleared up, then you know your I/O withthe web browser works fine.

7. To test a database, I suggest first using a known-good and Unicode savvyalternate input method for putting some Unicode text in the database, suchas using an admin/utility tool that came with the DBMS. Also make surethat the database is itself using UTF-8 character strings in its schema, egthat the schema is declared this way.

8. With a database known to contain some valid Unicode etc text, you firsttest simply selecting that text from the database and displaying it. Ifanything doesn't match, it means you probably have to configure your DBMSclient connection encoding so it is UTF-8 (often done with a few certainSQL commands), and then separately ensure that Perl is decoding the UTF-8data into Perl text strings properly. Its important to make sure you canretrieve Unicode from the database properly so that you have a context forjudging that you can insert such text in the database.

9. Next try to insert some Unicode text in the database using your program,then select it back to check that it worked. If it didn't, then check DBMSclient connection settings, or that Perl is encoding text as UTF-8 properly.

10. Actually, when you have a known-good external tool to help you, you canalternately start the DBMS tests with step 9, where your program insertstext, then you use the known-good tool to ensure it actually was recordedproperly.

Anyway, that's it in a nutshell. Now I'm sure many of you have alreadyfigured this out, but for those who haven't, I hope these tips help you.Adjust as appropriate to account for any abstraction tools or frameworksyou are using which means your tests may also involve testing those toolsor configuring them.


-- Darren Duncan

Hugh Hunter wrote:

I've been struggling with this for some time and know there must be ananswer out there.
I'm using URL arguments to pass parameters to my controller. It's asite about names, so take the url http://domain.com/name/Jesús (note theaccented u). The Name.pm controller has an :Args(1) decorator so Jesúsis stored in $name and then passed to my DBIC model in a ->search({name=> $name}) call. This doesn't manage to find the row that exists inmysql. When I dump $name I get:
'name' => 'Jes\xc3\xbas'
which I think I understand as being perl's internal escaping of utf-8characters.
I've done everything recommended onhttp://dev.catalystframework.org/wiki/gettingstarted/tutorialsandhowtos/using_unicode andthe name column in my mysql database uses the utf-8 charset.
Where am I going wrong?



_______________________________________________
List: Catalyst@lists.scsys.co.uk
Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/
Dev site: http://dev.catalyst.perl.org/

[Catalyst] tips for troubleshooting/QAing Unicode (was Re: Passing UTF-8 arg in URL to DBIC search)

Reply via email to