Re: CGI and character encoding

2011-02-25 Thread André Warnier

Thanks to Michael, Michael, Lloyd, Cees,

your answers and insights have made things clearer for me.
I think I'll use a combination of all of that for this new application we're 
writing.

In other words, to program defensively, I propose to do this :

when sending the html page with the form :
- create the page and save it as UTF-8
- have the proper charset indications in it
- include a hidden test field with some known UTF-8 sequence (e.g. ÄÖÜ)
- make sure that the application and the webserver send out the page with the proper 
Content-type and charset (HTTP headers)


But since we still don't know what the browser (and the user) will actually do 
with this,

upon reception of the POST :
- get the test field and check how it was received
a) check if it has the is_utf8() flag set (probably not)
b) if not (a) check if at least it has the correct UTF-8 bytes in it 
(6, not 3)
c) if nor (a) nor (b), reject with error (don't know what it is  then)
d) if not (a), but (b), then set a flag 'must_decode'

- get the other parameters, and
- if the 'must_decode' flag is not set, leave them 'as is'
- if the flag is set, Encode::decode('utf8',..) all received
parameters, except for file uploads (*)

That's of course in the hope that, some day, browsers will send multipart data with the 
proper charset indication, and that CGI.pm will take it into account and do the right thing.




(*) although a question then is how a Polish browser would send the filename attribute, 
assuming it is originally something like Qualitätsübersicht.pdf




CGI and character encoding

2011-02-24 Thread André Warnier

Hi.

I wonder if someone here can give me a clue as to where to look...

I am using
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with Suhosin-Patch 
mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0


perl -MCGI -e 'print $CGI::VERSION'
3.52

A perl cgi-bin script running under mod_perl, receives posted form parameters from a form 
defined as such :


!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
   http://www.w3.org/TR/html4/loose.dtd;
html
head
meta http-equiv=Content-Type content=text/html;charset=UTF-8

 body
form action=/litfdm/litfdm.pl name=form
enctype=multipart/form-data charset=UTF-8 method=POST
...
input name=de-utf8 type=hidden value=ÄäÖöÜü
...

(Note: the html page itself has been saved as UTF-8 by an UTF-8 aware editor)


When I retrieve the above hidden field using

my $chars = $cgi-param('de-utf8');

the variable $chars does contain the proper UTF-8 encoded *bytes* for the above string (in 
other words, 2 bytes per character e.g.), but it arrives into the script /without/ the 
perl utf8 flag set.


If I then use this value to print to a filehandle opened as such :

open(FH,':utf8',myfile);
print FH $chars,\n;

It comes out of course as .. well, I cannot type this on my keyboard, but anyone aware of 
double-encoding issues can imagine the A-tilde Copyright A-tilde squiggle..  result.


I can of course convert it, by using

$chars = Encode::decode('utf8',$cgi-param('de-utf8'));

but it is a p.i.t.a. and I would like to know if there is a way to retrieve the posted 
value directly as UTF-8, and if yes what this depends on.

(I cannot find a setting for instance in the CGI.pm module documentation.)


Thanks.
André

P.S.
Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the server, it is 
posting it as something like


...
Content-Typemultipart/form-data; 
boundary=---326972172326727
...

-326972172326727
Content-Disposition: form-data; name=de-utf8

ÄäÖöÜü
-326972172326727

which means that there is no charset header to the parts either.


Re: CGI and character encoding

2011-02-24 Thread Michael Peters

On 02/24/2011 04:31 PM, André Warnier wrote:


I wonder if someone here can give me a clue as to where to look...


The CGI.pm documentation talks about the -utf8 import flag which is 
probably what you're looking for. But it does caution not to use it for 
anything that needs to do file uploads.


--
Michael Peters
Plus Three, LP


Re: CGI and character encoding

2011-02-24 Thread Cees Hek
Hi André,

There is a perlmonks post from a few years ago that explains one way
of automating this with CGI.pm.  I've used this for several years now
without problems.

http://www.perlmonks.org/?node_id=651574

Just remember that decoding params is just one part of dealing with
utf-8.  You need to worry about any data coming into or going out of
your app (reading files, retrieving from DB, send HTML out to the
browser, etc...).  The following wiki book has some great information
on how to deal with utf-8 in your perl applications (and it also
includes the CGI.pm hack from Rhesa that I linked to above in the
perlmonks link).

http://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8

Cheers,

Cees Hek


On Fri, Feb 25, 2011 at 8:31 AM, André Warnier a...@ice-sa.com wrote:
 Hi.

 I wonder if someone here can give me a clue as to where to look...

 I am using
 Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with
 Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0
 mod_perl/2.0.4 Perl/v5.10.0

 perl -MCGI -e 'print $CGI::VERSION'
 3.52

 A perl cgi-bin script running under mod_perl, receives posted form
 parameters from a form defined as such :

 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
       http://www.w3.org/TR/html4/loose.dtd;
 html
        head
        meta http-equiv=Content-Type content=text/html;charset=UTF-8
 
  body
        form action=/litfdm/litfdm.pl name=form
                enctype=multipart/form-data charset=UTF-8 method=POST
 ...
 input name=de-utf8 type=hidden value=ÄäÖöÜü
 ...

 (Note: the html page itself has been saved as UTF-8 by an UTF-8 aware
 editor)


 When I retrieve the above hidden field using

 my $chars = $cgi-param('de-utf8');

 the variable $chars does contain the proper UTF-8 encoded *bytes* for the
 above string (in other words, 2 bytes per character e.g.), but it arrives
 into the script /without/ the perl utf8 flag set.

 If I then use this value to print to a filehandle opened as such :

 open(FH,':utf8',myfile);
 print FH $chars,\n;

 It comes out of course as .. well, I cannot type this on my keyboard, but
 anyone aware of double-encoding issues can imagine the A-tilde Copyright
 A-tilde squiggle..  result.

 I can of course convert it, by using

 $chars = Encode::decode('utf8',$cgi-param('de-utf8'));

 but it is a p.i.t.a. and I would like to know if there is a way to retrieve
 the posted value directly as UTF-8, and if yes what this depends on.
 (I cannot find a setting for instance in the CGI.pm module documentation.)


 Thanks.
 André

 P.S.
 Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the
 server, it is posting it as something like

 ...
 Content-Type    multipart/form-data;
 boundary=---326972172326727
 ...

 -326972172326727
 Content-Disposition: form-data; name=de-utf8

 ÄäÖöÜü
 -326972172326727

 which means that there is no charset header to the parts either.



RE: CGI and character encoding

2011-02-24 Thread Lloyd Richardson
FWIW, with CGI.pm I always iterate through the params and Encode::decode with 
the appropriate encoding with an exception for anything binary. (file uploads 
etc)


-Original Message-
From: André Warnier [mailto:a...@ice-sa.com] 
Sent: Thursday, February 24, 2011 3:31 PM
To: mod_perl list
Subject: CGI and character encoding

Hi.

I wonder if someone here can give me a clue as to where to look...

I am using
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with 
Suhosin-Patch 
mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 
Perl/v5.10.0

perl -MCGI -e 'print $CGI::VERSION'
3.52

A perl cgi-bin script running under mod_perl, receives posted form parameters 
from a form 
defined as such :

!DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
http://www.w3.org/TR/html4/loose.dtd;
html
head
 meta http-equiv=Content-Type content=text/html;charset=UTF-8

  body
form action=/litfdm/litfdm.pl name=form
enctype=multipart/form-data charset=UTF-8 method=POST
...
input name=de-utf8 type=hidden value=ÄäÖöÜü
...

(Note: the html page itself has been saved as UTF-8 by an UTF-8 aware editor)


When I retrieve the above hidden field using

my $chars = $cgi-param('de-utf8');

the variable $chars does contain the proper UTF-8 encoded *bytes* for the above 
string (in 
other words, 2 bytes per character e.g.), but it arrives into the script 
/without/ the 
perl utf8 flag set.

If I then use this value to print to a filehandle opened as such :

open(FH,':utf8',myfile);
print FH $chars,\n;

It comes out of course as .. well, I cannot type this on my keyboard, but 
anyone aware of 
double-encoding issues can imagine the A-tilde Copyright A-tilde squiggle..  
result.

I can of course convert it, by using

$chars = Encode::decode('utf8',$cgi-param('de-utf8'));

but it is a p.i.t.a. and I would like to know if there is a way to retrieve the 
posted 
value directly as UTF-8, and if yes what this depends on.
(I cannot find a setting for instance in the CGI.pm module documentation.)


Thanks.
André

P.S.
Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the 
server, it is 
posting it as something like

...
Content-Typemultipart/form-data; 
boundary=---326972172326727
...

-326972172326727
Content-Disposition: form-data; name=de-utf8

ÄäÖöÜü
-326972172326727

which means that there is no charset header to the parts either.


Re: CGI and character encoding

2011-02-24 Thread André Warnier

Michael Peters wrote:

On 02/24/2011 04:31 PM, André Warnier wrote:


I wonder if someone here can give me a clue as to where to look...


The CGI.pm documentation talks about the -utf8 import flag which is 
probably what you're looking for. But it does caution not to use it for 
anything that needs to do file uploads.




Thanks. My workstation version of the CGI documentation is apparently outdated, and did 
not mention that pragma.  The CPAN version does.
But yes, I will need file uploads too, and since there is no telling how exactly the -utf8 
flag interferes with them, I think I'll stick with the p.i.t.a. method for now.


I wonder why browsers do not put a charset parameter in the multipart/form-data 
parts..
It would seem like a logical and MIME-conformant thing to do.





Re: CGI and character encoding

2011-02-24 Thread Michael Schout
On 02/24/2011 03:31 PM, André Warnier wrote:
 Hi.
 
 I wonder if someone here can give me a clue as to where to look...

If you have a fairly recent CGI.pm, it will decode utf-8 properly for
you (even avoiding double-decoding), but there are some caveats.  In
addition to what others have already said, If you are running under
mod_perl (which obviously you are), CGI.pm adds a cleanup handler (via
register_cleanup) which resets CGI.pm's global variables.  One of the
variables that gets reset is the PARAM_UTF8 variable (which the  -utf8
import controls).  Because of this, once the clenaup handler gets
called, UTF-8 decoding gets turned off.

You have to work around this by manually making sure $CGI::PARAM_UTF8 =
1 before calling CGI-new.

Regards,
Michael Schout