Re: utf8 urls

André Warnier Wed, 19 Mar 2008 06:33:37 -0700

I think that these things can get very confused and confusing veryquickly, unless one steps through them one step at a time.

Let me try a first iteration :

1) URI's, as sent to the HTTP server, should contain only US-ASCIIcharacters (and no spaces). If there are other characters, they shouldbe encoded using the appropriate RFC-dictated URI-encoding scheme.2) Whether Firefox is smart enough to automatically encode a URIproperly, when it notices that it contains non-US-ASCII characters, is anice aspect of Firefox if it does, but should not confuse the main issue.In other words, if you send a non-ASCII URI to a server (via curl orlwp-request e.g.), then you should arrange yourself to URI-encode therequest.3) According to a previous response, at the receiving side, when Apachegets a properly-encoded request URI containing non-ASCII characters, itleaves it encoded and passes it "as is" (or "as bytes") to theprocessing layer, which in this case is mod_perl.4) mod_perl parses the URI and makes it accessible in several ways tothe modules running under it (in this case a request handler or a script).Question : does mod_perl decode the URI string prior to passing it inbits and pieces to the handler/script, or not ?

(From another response, it would seem that it doesn't)

5) the handler/script obtains the URI parts from mod_perl, possiblythrough the RequestRec or Request object.If such URI parts contained non-ASCII characters, do these modulesperform any translation, or does the handler/script still receive themas URI-encoded ?

(From another response, it would seem that they don't, and it does)

6) Now the handler/script has the value of the (for instance) queryparameter "id" (and assume it contains non-ASCII characters), and itwants to output it back to the browser.To do that, it must arrange to send to the browser a HTTP header thatwill tell the browser in which character set this response is encoded,since by default the HTTP protocol says it is iso-8859-1.

And it seems that in order to do that, it should use, as minimum

$param = $apr->param('id');
$r->content_type('text/plain; charset="UTF-8"');
$r->print $param;

There are a couple of aspects not mentioned above, such as

- how does the handler/script "know" which decoding it should apply tothe URI elements ? Is it certain that it is UTF-8 ?



Another go, anyone ?

André





Torsten Foertsch wrote:

On Wed 19 Mar 2008, Eli Shemer wrote:

For some reason the following test doesn’t print anything out to the screen

Do I need to change something in the apache configuration, or mod_perl’s ?

/articles_read.pl?id=חוזרת


This is probably a bug in libapreq2. I have tried this handler:

sub {
  my $r=$_[0];
  $r->content_type('text/html; charset=UTF-8');
  my $x=Apache2::Request->new($r);

$r->print("<html><body>\nargs=".$r->args."\nparam(x)=".$x->param('x')."\n</body></html>\n");

  return Apache2::Const::OK;
}

http://localhost/test?x=חוזרת entered in FF changes on the fly into
http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works.

But on the command line with curl it doesn't:

$ curl 'http://localhost/test?x=חוזרת' -v
* About to connect() to localhost port 80 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 80 (#0)

GET /test?x=חוזרת HTTP/1.1
User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e

zlib/1.2.3 libidn/1.0

Host: localhost
Accept: */*

< HTTP/1.1 200 OK
< Date: Wed, 19 Mar 2008 12:45:29 GMT

< Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8

< Transfer-Encoding: chunked
< Content-Type: text/html; charset=UTF-8
<
<html><body>
args=x=חוזרת
param(x)=
</body></html>
* Connection #0 to host localhost left intact
* Closing connection #0

Torsten

Re: utf8 urls

Reply via email to