mod_perl, html forms and unicode/utf-8
Hi folks, I've been dragged kicking and screaming into the 21st century and am making my mod_perl application fully utf-8 aware and transparent. It's all going OK but I want to know if anyone has a better solution to receiving form data containing non-ASCII chars. Output is fine - I can override any Apache settings with $r->content_type('text/html; charset=utf-8'); The puzzling bit was getting UTF-8 in and out of forms without mangling it - I have it working but I want to know if anyone has a better solution - here's what I do: HTML: (the page is specifically set to be utf-8) Perl: use Encode; sub handler { my $r=shift; my $q=Apache2::Request->new($r); my $known_to_be_utf8 = $q->param('test'); # form post doesn't give charset, none assumed my $utf8_aware_string = decode_utf8( $known_to_be_utf8 ); .. # the above works (we get our data back in one piece) # and of course the HTML entities have been turned into UTF-8 chars } I tried some form attributes: enctype="multipart/form-data" - this doesn't specify a charset in the content-type headers (tried IE6 and FF) accept-charset="utf-8" - no change for me (as no charset transformation required) So there's no way for the server to know what charset the parameters are in, the application has to know what to expect. Any thoughts? cheers John
Re: mod_perl, html forms and unicode/utf-8
On Jun 5, 2007, at 12:56 PM, John ORourke wrote: my $q=Apache2::Request->new($r); my $known_to_be_utf8 = $q->param('test'); # form post doesn't give charset, none assumed slightly off topic, my suggestion on implementation would be along the lines of this: package Context(); sub new(){} sub getPost{} sub getGet{} sub getGetPost{} 1; where you have a Context class. instead of calling $q->param ,you call $ctx->getPost Context does two things: a- it encapsulates all of the charset crap, so you only write it once. b- it gives you a uniform access to a few apache::request and cgi functions. this way you can switch between the two as needed. i switched to that a few weeks ago. i think randy has a cpan module that is similar. its a damn lifesaver though. i took a negligible hit in wrapping all my get/post in that, but all of my data santization routines are right there in that module. (strip whitespace, strip charsets i don't want, etc ) ( i actually double wrapped it -- class: Context gives me a wrapper, class MyApp::Context subclasses Context and overloads the getpost with custom sanitization routines per app ).
Re: mod_perl, html forms and unicode/utf-8
Hi John I've been using libapreq, which has a charset method: http://search.cpan.org/~joesuf/libapreq2-2.08/glue/perl/xsbuilder/APR/Request/Param/Param.pod#charset It is fairly limited, it recognises: 0 APREQ_CHARSET_ASCII (7-bit us-ascii) 1 APREQ_CHARSET_LATIN1 (8-bit iso-8859-1) 2 APREQ_CHARSET_CP1252 (8-bit Windows-1252) 8 APREQ_CHARSET_UTF8 (utf8 encoded Unicode) but this has been working fine for me on IE 6, 7, Firefox and Opera. I think (not sure) that these more modern browsers do try to respect the character set of the web page. It hasn't been tested to the point that I am certain that it works every time, but I've had no problems with it over the last year of use. Don't forget the other part, which is that, if you put UTF8 into the database, you may need to reset the UTF8 flag when you get the data back again. The new DBD::MySQL driver has added this automatically, but I haven't tried it - I've been using my own wrapper on an older driver which I know works. Not sure about other drivers, but (again) I "think" there is reasonable support for UTF8 on the more popular ones. Once you're happy with the fact that the data coming in and out of your system is UTF8, it makes life a lot easier. Things like filtering input data with \w just work. good luck Clint > Perl: > use Encode; > sub handler { >my $r=shift; >my $q=Apache2::Request->new($r); >my $known_to_be_utf8 = $q->param('test'); # form post doesn't > give charset, none assumed >my $utf8_aware_string = decode_utf8( $known_to_be_utf8 ); >.. ># the above works (we get our data back in one piece) ># and of course the HTML entities have been turned into UTF-8 chars > } > > I tried some form attributes: > enctype="multipart/form-data" - this doesn't specify a charset in > the content-type headers (tried IE6 and FF) > accept-charset="utf-8" - no change for me (as no charset > transformation required) > > So there's no way for the server to know what charset the parameters are > in, the application has to know what to expect. > > Any thoughts? > > cheers > John >
Re: mod_perl, html forms and unicode/utf-8
Thanks Gents, I've got a certain level of abstraction as per Jonathan's approach, which I can just add the libapreq method. The note about DBD::MySQL is interesting, I was wondering about that! cheers John Clinton Gormley wrote: Hi John I've been using libapreq, which has a charset method: http://search.cpan.org/~joesuf/libapreq2-2.08/glue/perl/xsbuilder/APR/Request/Param/Param.pod#charset It is fairly limited, it recognises: 0 APREQ_CHARSET_ASCII (7-bit us-ascii) 1 APREQ_CHARSET_LATIN1 (8-bit iso-8859-1) 2 APREQ_CHARSET_CP1252 (8-bit Windows-1252) 8 APREQ_CHARSET_UTF8 (utf8 encoded Unicode) but this has been working fine for me on IE 6, 7, Firefox and Opera. I think (not sure) that these more modern browsers do try to respect the character set of the web page. It hasn't been tested to the point that I am certain that it works every time, but I've had no problems with it over the last year of use. Don't forget the other part, which is that, if you put UTF8 into the database, you may need to reset the UTF8 flag when you get the data back again. The new DBD::MySQL driver has added this automatically, but I haven't tried it - I've been using my own wrapper on an older driver which I know works. Not sure about other drivers, but (again) I "think" there is reasonable support for UTF8 on the more popular ones. Once you're happy with the fact that the data coming in and out of your system is UTF8, it makes life a lot easier. Things like filtering input data with \w just work. good luck Clint Perl: use Encode; sub handler { my $r=shift; my $q=Apache2::Request->new($r); my $known_to_be_utf8 = $q->param('test'); # form post doesn't give charset, none assumed my $utf8_aware_string = decode_utf8( $known_to_be_utf8 ); .. # the above works (we get our data back in one piece) # and of course the HTML entities have been turned into UTF-8 chars } I tried some form attributes: enctype="multipart/form-data" - this doesn't specify a charset in the content-type headers (tried IE6 and FF) accept-charset="utf-8" - no change for me (as no charset transformation required) So there's no way for the server to know what charset the parameters are in, the application has to know what to expect. Any thoughts? cheers John