I think that these things can get very confused and confusing very
quickly, unless one steps through them one step at a time.
Let me try a first iteration :
1) URI's, as sent to the HTTP server, should contain only US-ASCII
characters (and no spaces). If there are other characters, they should
be encoded using the appropriate RFC-dictated URI-encoding scheme.
2) Whether Firefox is smart enough to automatically encode a URI
properly, when it notices that it contains non-US-ASCII characters, is a
nice aspect of Firefox if it does, but should not confuse the main issue.
In other words, if you send a non-ASCII URI to a server (via curl or
lwp-request e.g.), then you should arrange yourself to URI-encode the
request.
3) According to a previous response, at the receiving side, when Apache
gets a properly-encoded request URI containing non-ASCII characters, it
leaves it encoded and passes it "as is" (or "as bytes") to the
processing layer, which in this case is mod_perl.
4) mod_perl parses the URI and makes it accessible in several ways to
the modules running under it (in this case a request handler or a script).
Question : does mod_perl decode the URI string prior to passing it in
bits and pieces to the handler/script, or not ?
(From another response, it would seem that it doesn't)
5) the handler/script obtains the URI parts from mod_perl, possibly
through the RequestRec or Request object.
If such URI parts contained non-ASCII characters, do these modules
perform any translation, or does the handler/script still receive them
as URI-encoded ?
(From another response, it would seem that they don't, and it does)
6) Now the handler/script has the value of the (for instance) query
parameter "id" (and assume it contains non-ASCII characters), and it
wants to output it back to the browser.
To do that, it must arrange to send to the browser a HTTP header that
will tell the browser in which character set this response is encoded,
since by default the HTTP protocol says it is iso-8859-1.
And it seems that in order to do that, it should use, as minimum
$param = $apr->param('id');
$r->content_type('text/plain; charset="UTF-8"');
$r->print $param;
There are a couple of aspects not mentioned above, such as
- how does the handler/script "know" which decoding it should apply to
the URI elements ? Is it certain that it is UTF-8 ?
Another go, anyone ?
André
Torsten Foertsch wrote:
On Wed 19 Mar 2008, Eli Shemer wrote:
For some reason the following test doesn’t print anything out to the screen
Do I need to change something in the apache configuration, or mod_perl’s ?
/articles_read.pl?id=חוזרת
This is probably a bug in libapreq2. I have tried this handler:
sub {
my $r=$_[0];
$r->content_type('text/html; charset=UTF-8');
my $x=Apache2::Request->new($r);
$r->print("<html><body>\nargs=".$r->args."\nparam(x)=".
$x->param('x')."\n</body></html>\n");
return Apache2::Const::OK;
}
http://localhost/test?x=חוזרת entered in FF changes on the fly into
http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.
But on the command line with curl it doesn't:
$ curl 'http://localhost/test?x=חוזרת' -v
* About to connect() to localhost port 80 (#0)
* Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
GET /test?x=חוזרת HTTP/1.1
User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e
zlib/1.2.3 libidn/1.0
Host: localhost
Accept: */*
< HTTP/1.1 200 OK
< Date: Wed, 19 Mar 2008 12:45:29 GMT
< Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5
mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
<html><body>
args=x=חוזרת
param(x)=
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0
Torsten