mod_perl, html forms and unicode/utf-8

2007-06-05 Thread John ORourke

Hi folks,

I've been dragged kicking and screaming into the 21st century and am 
making my mod_perl application fully utf-8 aware and transparent.  It's 
all going OK but I want to know if anyone has a better solution to 
receiving form data containing non-ASCII chars.


Output is fine - I can override any Apache settings with 
$r->content_type('text/html; charset=utf-8');


The puzzling bit was getting UTF-8 in and out of forms without mangling 
it - I have it working but I want to know if anyone has a better 
solution - here's what I do:


HTML:  (the page is specifically set to be utf-8)
 


Perl:
   use Encode;
   sub handler {
  my $r=shift;
  my $q=Apache2::Request->new($r);
  my $known_to_be_utf8 = $q->param('test'); # form post doesn't 
give charset, none assumed

  my $utf8_aware_string = decode_utf8( $known_to_be_utf8 );
  ..
  # the above works (we get our data back in one piece)
  # and of course the HTML entities have been turned into UTF-8 chars
   }

I tried some form attributes:
   enctype="multipart/form-data" - this doesn't specify a charset in 
the content-type headers (tried IE6 and FF)
   accept-charset="utf-8" - no change for me (as no charset 
transformation required)


So there's no way for the server to know what charset the parameters are 
in, the application has to know what to expect.


Any thoughts?

cheers
John



Re: mod_perl, html forms and unicode/utf-8

2007-06-05 Thread Jonathan Vanasco


On Jun 5, 2007, at 12:56 PM, John ORourke wrote:

  my $q=Apache2::Request->new($r);
  my $known_to_be_utf8 = $q->param('test'); # form post doesn't  
give charset, none assumed



slightly off topic, my suggestion on implementation would be along  
the lines of this:


package Context();
sub new(){}
sub getPost{}
sub getGet{}
sub getGetPost{}
1;

where you have a Context class.  instead of calling $q->param ,you  
call $ctx->getPost


Context does two things:
a- it encapsulates all of the charset crap, so you only write it once.
	b- it gives you a uniform access to a few apache::request and cgi  
functions.  this way you can switch between the two as needed.


i switched to that a few weeks ago.  i think randy has a cpan module  
that is similar.


its a damn lifesaver though.  i took a negligible hit in wrapping all  
my get/post in that, but all of my data santization routines are  
right there in that module.  (strip whitespace, strip charsets i  
don't want, etc )


( i actually double wrapped it -- class: Context gives me a wrapper,  
class MyApp::Context subclasses Context and overloads the getpost  
with custom sanitization routines per app ).






Re: mod_perl, html forms and unicode/utf-8

2007-06-05 Thread Clinton Gormley
Hi John

I've been using libapreq, which has a charset method:
http://search.cpan.org/~joesuf/libapreq2-2.08/glue/perl/xsbuilder/APR/Request/Param/Param.pod#charset

It is fairly limited, it recognises:

0 APREQ_CHARSET_ASCII (7-bit us-ascii)
1 APREQ_CHARSET_LATIN1 (8-bit iso-8859-1)
2 APREQ_CHARSET_CP1252 (8-bit Windows-1252)
8 APREQ_CHARSET_UTF8 (utf8 encoded Unicode)

but this has been working fine for me on IE 6, 7, Firefox and Opera. I
think (not sure) that these more modern browsers do try to respect the
character set of the web page.

It hasn't been tested to the point that I am certain that it works every
time, but I've had no problems with it over the last year of use.


Don't forget the other part, which is that, if you put UTF8 into the
database, you may need to reset the UTF8 flag when you get the data back
again.

The new DBD::MySQL driver has added this automatically, but I haven't
tried it - I've been using my own wrapper on an older driver which I
know works. Not sure about other drivers, but (again) I "think" there is
reasonable support for UTF8 on the more popular ones.

Once you're happy with the fact that the data coming in and out of your
system is UTF8, it makes life a lot easier.  Things like filtering input
data with \w just work.

good luck

Clint

> Perl:
> use Encode;
> sub handler {
>my $r=shift;
>my $q=Apache2::Request->new($r);
>my $known_to_be_utf8 = $q->param('test'); # form post doesn't 
> give charset, none assumed
>my $utf8_aware_string = decode_utf8( $known_to_be_utf8 );
>..
># the above works (we get our data back in one piece)
># and of course the HTML entities have been turned into UTF-8 chars
> }
> 
> I tried some form attributes:
> enctype="multipart/form-data" - this doesn't specify a charset in 
> the content-type headers (tried IE6 and FF)
> accept-charset="utf-8" - no change for me (as no charset 
> transformation required)
> 
> So there's no way for the server to know what charset the parameters are 
> in, the application has to know what to expect.
> 
> Any thoughts?
> 
> cheers
> John
> 



Re: mod_perl, html forms and unicode/utf-8

2007-06-05 Thread John ORourke

Thanks Gents,

I've got a certain level of abstraction as per Jonathan's approach, 
which I can just add the libapreq method.


The note about DBD::MySQL is interesting, I was wondering about that!

cheers
John


Clinton Gormley wrote:

Hi John

I've been using libapreq, which has a charset method:
http://search.cpan.org/~joesuf/libapreq2-2.08/glue/perl/xsbuilder/APR/Request/Param/Param.pod#charset

It is fairly limited, it recognises:

0 APREQ_CHARSET_ASCII (7-bit us-ascii)
1 APREQ_CHARSET_LATIN1 (8-bit iso-8859-1)
2 APREQ_CHARSET_CP1252 (8-bit Windows-1252)
8 APREQ_CHARSET_UTF8 (utf8 encoded Unicode)

but this has been working fine for me on IE 6, 7, Firefox and Opera. I
think (not sure) that these more modern browsers do try to respect the
character set of the web page.

It hasn't been tested to the point that I am certain that it works every
time, but I've had no problems with it over the last year of use.


Don't forget the other part, which is that, if you put UTF8 into the
database, you may need to reset the UTF8 flag when you get the data back
again.

The new DBD::MySQL driver has added this automatically, but I haven't
tried it - I've been using my own wrapper on an older driver which I
know works. Not sure about other drivers, but (again) I "think" there is
reasonable support for UTF8 on the more popular ones.

Once you're happy with the fact that the data coming in and out of your
system is UTF8, it makes life a lot easier.  Things like filtering input
data with \w just work.

good luck

Clint

  

Perl:
use Encode;
sub handler {
   my $r=shift;
   my $q=Apache2::Request->new($r);
   my $known_to_be_utf8 = $q->param('test'); # form post doesn't 
give charset, none assumed

   my $utf8_aware_string = decode_utf8( $known_to_be_utf8 );
   ..
   # the above works (we get our data back in one piece)
   # and of course the HTML entities have been turned into UTF-8 chars
}

I tried some form attributes:
enctype="multipart/form-data" - this doesn't specify a charset in 
the content-type headers (tried IE6 and FF)
accept-charset="utf-8" - no change for me (as no charset 
transformation required)


So there's no way for the server to know what charset the parameters are 
in, the application has to know what to expect.


Any thoughts?

cheers
John