Re: utf8 urls
I think that these things can get very confused and confusing very quickly, unless one steps through them one step at a time. Let me try a first iteration : 1) URI's, as sent to the HTTP server, should contain only US-ASCII characters (and no spaces). If there are other characters, they should be encoded using the appropriate RFC-dictated URI-encoding scheme. 2) Whether Firefox is smart enough to automatically encode a URI properly, when it notices that it contains non-US-ASCII characters, is a nice aspect of Firefox if it does, but should not confuse the main issue. In other words, if you send a non-ASCII URI to a server (via curl or lwp-request e.g.), then you should arrange yourself to URI-encode the request. 3) According to a previous response, at the receiving side, when Apache gets a properly-encoded request URI containing non-ASCII characters, it leaves it encoded and passes it "as is" (or "as bytes") to the processing layer, which in this case is mod_perl. 4) mod_perl parses the URI and makes it accessible in several ways to the modules running under it (in this case a request handler or a script). Question : does mod_perl decode the URI string prior to passing it in bits and pieces to the handler/script, or not ? (From another response, it would seem that it doesn't) 5) the handler/script obtains the URI parts from mod_perl, possibly through the RequestRec or Request object. If such URI parts contained non-ASCII characters, do these modules perform any translation, or does the handler/script still receive them as URI-encoded ? (From another response, it would seem that they don't, and it does) 6) Now the handler/script has the value of the (for instance) query parameter "id" (and assume it contains non-ASCII characters), and it wants to output it back to the browser. To do that, it must arrange to send to the browser a HTTP header that will tell the browser in which character set this response is encoded, since by default the HTTP protocol says it is iso-8859-1. And it seems that in order to do that, it should use, as minimum $param = $apr->param('id'); $r->content_type('text/plain; charset="UTF-8"'); $r->print $param; There are a couple of aspects not mentioned above, such as - how does the handler/script "know" which decoding it should apply to the URI elements ? Is it certain that it is UTF-8 ? Another go, anyone ? André Torsten Foertsch wrote: On Wed 19 Mar 2008, Eli Shemer wrote: For some reason the following test doesn’t print anything out to the screen Do I need to change something in the apache configuration, or mod_perl’s ? /articles_read.pl?id=חוזרת This is probably a bug in libapreq2. I have tried this handler: sub { my $r=$_[0]; $r->content_type('text/html; charset=UTF-8'); my $x=Apache2::Request->new($r); $r->print("\nargs=".$r->args."\nparam(x)=". $x->param('x')."\n\n"); return Apache2::Const::OK; } http://localhost/test?x=חוזרת entered in FF changes on the fly into http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works. But on the command line with curl it doesn't: $ curl 'http://localhost/test?x=חוזרת' -v * About to connect() to localhost port 80 (#0) * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 80 (#0) GET /test?x=חוזרת HTTP/1.1 User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e zlib/1.2.3 libidn/1.0 Host: localhost Accept: */* < HTTP/1.1 200 OK < Date: Wed, 19 Mar 2008 12:45:29 GMT < Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5 mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8 < Transfer-Encoding: chunked < Content-Type: text/html; charset=UTF-8 < args=x=חוזרת param(x)= * Connection #0 to host localhost left intact * Closing connection #0 Torsten
Re: utf8 urls
Geoffrey Young wrote: John ORourke wrote: Eli Shemer wrote: For some reason the following test doesn’t print anything out to the screen I'm not sure why you get nothing, but I can tell you strings read from Apache objects come through as octets and need to be decoded before use. We're using UTF-8 chars in URLs but I've never used one in a GET request parameter. I can't say why it doesn't work, but I'm surprised it would in either case - the only characters explicitly allowed in a uri are us-ascii. from rfc2396: My bad memory there - you are quite correct. The way we do it is the accepted way - to URL-encode the UTF-8 encoded text, and that will work with URLs and parameters. eg: http://www../categories/name/ty%C3%B6kalut-lamput is the correct form of: http://www../categories/name/työkalut-lamput encode before printing: $octets = utf8_encode($my_utf8_string); # make octets $octets =~ s/([^\041-\177])/sprintf("%%%02X",ord($1))/ge; # URL-encode non-ASCII chars $r->print($octets); (the above is simplified - you'll also need to encode question marks etc) decode after reading: $url = utf8_decode ( $r->uri() ); or $param = utf8_decode ( $r->param('info') ); cheers John
Re: utf8 urls
On Wed 19 Mar 2008, Eli Shemer wrote: > For some reason the following test doesn’t print anything out to the screen > > Do I need to change something in the apache configuration, or mod_perl’s ? > > > > /articles_read.pl?id=חוזרת This is probably a bug in libapreq2. I have tried this handler: sub { my $r=$_[0]; $r->content_type('text/html; charset=UTF-8'); my $x=Apache2::Request->new($r); $r->print("\nargs=".$r->args."\nparam(x)=". $x->param('x')."\n\n"); return Apache2::Const::OK; } http://localhost/test?x=חוזרת entered in FF changes on the fly into http://localhost/test?x=%D7%97%D7%95%D7%96%D7%A8%D7%AA and it works. But on the command line with curl it doesn't: $ curl 'http://localhost/test?x=חוזרת' -v * About to connect() to localhost port 80 (#0) * Trying 127.0.0.1... connected * Connected to localhost (127.0.0.1) port 80 (#0) > GET /test?x=חוזרת HTTP/1.1 > User-Agent: curl/7.16.4 (i686-suse-linux-gnu) libcurl/7.16.4 OpenSSL/0.9.8e zlib/1.2.3 libidn/1.0 > Host: localhost > Accept: */* > < HTTP/1.1 200 OK < Date: Wed, 19 Mar 2008 12:45:29 GMT < Server: Apache/2.2.6 (Unix) mod_ssl/2.2.6 OpenSSL/0.9.8e DAV/2 SVN/1.4.5 mod_apreq2-20051231/2.6.0 mod_perl/2.0.4-dev Perl/v5.8.8 < Transfer-Encoding: chunked < Content-Type: text/html; charset=UTF-8 < args=x=חוזרת param(x)= * Connection #0 to host localhost left intact * Closing connection #0 Torsten
Re: utf8 urls
John ORourke wrote: Eli Shemer wrote: For some reason the following test doesn’t print anything out to the screen Do I need to change something in the apache configuration, or mod_perl’s ? /articles_read.pl?id=חוזרת ## get http parameters $r = shift; $apr = Apache2::Request->new($r); print $apr->param('id'); I'm not sure why you get nothing, but I can tell you strings read from Apache objects come through as octets and need to be decoded before use. We're using UTF-8 chars in URLs but I've never used one in a GET request parameter. I can't say why it doesn't work, but I'm surprised it would in either case - the only characters explicitly allowed in a uri are us-ascii. from rfc2396: 2.4. Escape Sequences Data must be escaped if it does not have a representation using an unreserved character; this includes data that does not correspond to a printable character of the US-ASCII coded character set, or that corresponds to any US-ASCII character that is disallowed, as explained below. I bit of googling turned up this cpan module: http://search.cpan.org/dist/URI-Find-UTF8/lib/URI/Find/UTF8.pm where the docs point to a ja.wikipedia.org page. for me (firefox 2.0) clicking on the "original" uri (the one with the japanese characters) opens up a uri with the uri-escaped character sequence. it's like magic ;) anyway, my point wasn't to get into some huge debate on whether people are (successfully) using utf-8 characters in uris, etc. rather, it is that mod_perl is (mostly) merely a wrapper around apache, and if something is improper wrt an official rfc apache generally dismisses it rather than bending to a behavior which people may be using anyway. so, if it works, great. if not, try making your urls conform to 2396 and see if you have better results. --Geoff
Re: utf8 urls
From a previous message by Adam Prime in this same list : [...] SetHandler modperl doesn't bind 'print' to '$r->print'. Try SetHandler perl-script, or change your code to pass in the request object and use $r->print instead of print. [...] or, more verbously and explicitly : if in your Apache configuration for this "location", you used SetHandler modperl then, you should not assume that print() sends its output to the browser. But if you did (like you did) $r = shift; # get the Apache::RequestRec object then $r->print() does go back as a response to the browser. You should probably at least set a content-type header though, like $r->content_type('text/plain'); $r->print $apr->param('id'); and, in your case, it might also be a good idea to send back a header indicating which is the character set used (presumably UTF-8), since the default HTTP character set is iso-8859-1, and the string you send back doesn't look as being printable in that charset. But I don't know exactly how to do that best in mod_perl. Would the following work ? $r->content_type('text/plain; charset="UTF-8"'); Also, the previous message talking about how to handle your (apparently) UTF-8 request should be taken into account. André Eli Shemer wrote: Hey there For some reason the following test doesn’t print anything out to the screen Do I need to change something in the apache configuration, or mod_perl’s ? /articles_read.pl?id=חוזרת ## get http parameters $r = shift; $apr = Apache2::Request->new($r); print $apr->param('id'); thanks in advance. Internal Virus Database is out-of-date. Checked by AVG Free Edition. Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007 18:55
Re: utf8 urls
Eli Shemer wrote: For some reason the following test doesn’t print anything out to the screen Do I need to change something in the apache configuration, or mod_perl’s ? /articles_read.pl?id=חוזרת ## get http parameters $r = shift; $apr = Apache2::Request->new($r); print $apr->param('id'); I'm not sure why you get nothing, but I can tell you strings read from Apache objects come through as octets and need to be decoded before use. We're using UTF-8 chars in URLs but I've never used one in a GET request parameter. hope that helps, John thanks in advance. Internal Virus Database is out-of-date. Checked by AVG Free Edition. Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007 18:55
utf8 urls
Hey there For some reason the following test doesn’t print anything out to the screen Do I need to change something in the apache configuration, or mod_perl’s ? /articles_read.pl?id=חוזרת ## get http parameters $r = shift; $apr = Apache2::Request->new($r); print $apr->param('id'); thanks in advance. Internal Virus Database is out-of-date. Checked by AVG Free Edition. Version: 7.5.503 / Virus Database: 269.16.4/1146 - Release Date: 22/11/2007 18:55