I think that these things can get very confused and confusing very quickly, unless one steps through them one step at a time.
Let me try a first iteration :

1) URI's, as sent to the HTTP server, should contain only US-ASCII characters (and no spaces). If there are other characters, they should be encoded using the appropriate RFC-dictated URI-encoding scheme. 2) Whether Firefox is smart enough to automatically encode a URI properly, when it notices that it contains non-US-ASCII characters, is a nice aspect of Firefox if it does, but should not confuse the main issue. In other words, if you send a non-ASCII URI to a server (via curl or lwp-request e.g.), then you should arrange yourself to URI-encode the request. 3) According to a previous response, at the receiving side, when Apache gets a properly-encoded request URI containing non-ASCII characters, it leaves it encoded and passes it "as is" (or "as bytes") to the processing layer, which in this case is mod_perl. 4) mod_perl parses the URI and makes it accessible in several ways to the modules running under it (in this case a request handler or a script). Question : does mod_perl decode the URI string prior to passing it in bits and pieces to the handler/script, or not ?
(From another response, it would seem that it doesn't)
5) the handler/script obtains the URI parts from mod_perl, possibly through the RequestRec or Request object. If such URI parts contained non-ASCII characters, do these modules perform any translation, or does the handler/script still receive them as URI-encoded ?
(From another response, it would seem that they don't, and it does)
6) Now the handler/script has the value of the (for instance) query parameter "id" (and assume it contains non-ASCII characters), and it wants to output it back to the browser. To do that, it must arrange to send to the browser a HTTP header that will tell the browser in which character set this response is encoded, since by default the HTTP protocol says it is iso-8859-1.
And it seems that in order to do that, it should use, as minimum

$param = $apr->param('id');
$r->content_type('text/plain; charset="UTF-8"');
$r->print $param;

There are a couple of aspects not mentioned above, such as
- how does the handler/script "know" which decoding it should apply to the URI elements ? Is it certain that it is UTF-8 ?


Another go, anyone ?

André





Torsten Foertsch wrote:
On Wed 19 Mar 2008, Eli Shemer wrote:
For some reason the following test doesn’t print anything out to the screen

Do I need to change something in the apache configuration, or mod_perl’s ?

/articles_read.pl?id=חוזרת

This is probably a bug in libapreq2. I have tried this handler:

sub {
  my $r=$_[0];
  $r->content_type('text/html; charset=UTF-8');
  my $x=Apache2::Request->new($r);
$r->print("<html><body>\nargs=".$r->args."\nparam(x)=". $x->param('x')."\n</body></html>\n");
  return Apache2::Const::OK;
}

http://localhost/test?x=חוזרת entered in FF changes on the fly into
http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.

But on the command line with curl it doesn't:

$ curl 'http://localhost/test?x=חוזרת' -v
* About to connect() to localhost port 80 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)
GET /test?x=חוזרת HTTP/1.1
User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e
zlib/1.2.3 libidn/1.0
Host: localhost
Accept: */*

< HTTP/1.1 200 OK
< Date: Wed, 19 Mar 2008 12:45:29 GMT
< Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5 mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8
< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
<html><body>
args=x=חוזרת
param(x)=
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

Torsten

Reply via email to