Michael Schout wrote:
On 9/2/14, 4:19 PM, Randal L. Schwartz wrote:
## ensure utf8 CGI params:
$CGI::PARAM_UTF8 = 1;
Sorry to chime in late on this, but part of the problem with CGI.pm and
UTF-8 is that PARAM_UTF8 gets clobbered by a cleanup handler that CGI.pm
itself registers if its running under mod_perl.
This caused major headaches for me at one time until I figured this out.
You have to make sure to set $CGI::PARAM_UTF8 early, and FOR EVERY
REQUEST, because if you just set it globally (e.g.: in a startup perl
script), then it only works for the first request.
Hi.
Just an addendum to the discussion :
There are really two distinct approaches to this issue, and they work at
different levels :
1) is to "fix" CGI.pm so that it delivers the parameters in the way which you
expect.
As shown by the previous valuable and technical contributions, this generally works, but
it requires a certain level of expertise; and it does not necessarily work backwards with
all versions of mod_perl and CGI.pm.
2) is to take whatever CGI.pm does deliver to the calling script or module, and use a
couple of tricks and some additional code in ditto script or module, to ensure that
whatever CGI.pm delivers under whatever mod_perl version, the receiving script or module
always knows in the end what it is dealing with.
That is the method which I presented early in the discussion.
As stated in that contribution, it is not necessarily the most elegant or efficient way to
deal with the issue, but it has the advantage of working always, no matter which version
of CGI.pm and/or mod_perl are in use.
The real crux of the matter is this, in my view : as things stand today in terms of
protocol and RFCs, there is no real way for CGI.pm (or any comparable framework) to be
*sure* of the encoding of the data sent by a browser or another HTTP client agent. Even
the RFCs do not really provide a way by which this can be enforced. (*)
So if you are sure of what the client is sending, and the matter consists of *forcing*
CGI.pm to always communicate POST (or GET) data as UTF-8 encoded and utf8-marked (or the
opposite) to the calling script/module, then method 1 will work, and it is more elegant
and probably more efficient than method 2.
But if the matter consists of ensuring that the receiving code in the script/module which
handles the data submitted by the HTTP client, is resilient and "does the right thing"
whatever the submitted data really was, then in my opinion method 2 is better.
(But that's only my opinion of the moment, and I stand ready to be corrected).
(*) and if you believe this not to be true, please send me some references about it,
because I am really interested. It might save me some code in all my web-facing applications.