* On Sat, Sep 27 2008, Darren Duncan wrote: > Maybe you're already aware of this, but I've found from experience > that troubleshooting encoding/Unicode problems in a web/db app can be > difficult, especially with multiple conversions at different stages, > but I've come up with a short generic algorithm to help test/ensure > that things are working and where things need fixing.
A simplified version: 1) Identify sources of input to your application 2) Ensure that you called Encode::decode('the-character-encoding', ...) on all that data. If you are dealing with pure ASCII, I guess you can skip this step. Encode::decode('us-ascii', ...) probably works though. Sometimes libraries will do this for you, but don't count on it, verify it. If you don't see the code doing it, it's not being done. Note that the existence of the "UTF-8 flag" does not tell you whether this is being correctly done. Your program can be perfectly Unicode-clean and never have a string with the UTF-8 flag on. If you see stuff like utf8::encode and utf8::decode or Encode::_utf8_on and so on, your program is horribly broken. Use Encode properly before continuing. Finally, keep in mind that there are odd sources of data. Hash keys from config files, file names, file extended attributes, form params, form field names, URIs(*), etc. (*) handle these manually, there is no mention of Unicode in the URI standard. Some people do things like put Japanese text in the HTTP headers. This is not allowed. ASCII only. 3) Identify where you output text. 4) Ensure that you called Encode::encode('output-character-encoding', ...) on any data that leaves your program. In the case of dealing with external applications, make sure that you've told them what the output character encoding is. Databases have flags for this, HTTP has the Content-type header, etc. 5) You're done. I have found that Devel::StringInfo is very helpful; you can have it dump the information when you are inputting data... it will make it clear when you have bytes instead of characters. Be sure to test with all sorts of input -- I always use characters from ASCII ("foo"), Latin ("ÿ"), and Japanese ("ほげ"). If your app gets those three right, it is probably OK. Regards, Jonathan Rockway -- print just => another => perl => hacker => if $,=$" _______________________________________________ List: Catalyst@lists.scsys.co.uk Listinfo: http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst Searchable archive: http://www.mail-archive.com/catalyst@lists.scsys.co.uk/ Dev site: http://dev.catalyst.perl.org/