Joe Schaefer wrote:
> Tatsuhiko Miyagawa <[EMAIL PROTECTED]> writes:
> 
> > It seems that Apache's ap_unescape_url() can't handle %uXXXX style
> > URI-escaped Unicode string, hence Apache::Request cannot neighther,
> > while CGI.pm can.
> 
> You may want to take this issue up on [EMAIL PROTECTED]
> Personally I've never seen this kind of character encoding, 
> and my reading of
> 
>   Section 8 at http://www.w3.org/TR/charmod/ 
>   and RFC 2718, Section 2.2.5, 
> 
> seems to indicate that this isn't a recommended practice. 

You didn't mention what the recommended practice actually 
is - perhaps because everyone except me already knows.  Just
in case someone else is interested but not interested
enough to try and decrypt the recommendation documents, here's
my interpretation of what they say:

A URI which contains characters beyond position 127 should first
be encoded as UTF-8 and then the individual bytes of the sequence
should be encoded using the %XX hex encoding.  This type of URI
is known as an Internationalized Resource Identifier or 'IRI'.

For example, say you wanted to invoke a CGI script (obviously
you'd be using Apache::Registry for this message to be on-topic)
and pass it the Euro symbol in the 'cs' (currency symbol) 
parameter in the query string.

The Euro symbol occurs at character position 0x20AC in Unicode.
In UTF-8 encoding this is the three byte sequence: 0xE2 0x82 0xAC
(see http://perl-xml.sourceforge.net/faq/#utf8 if you're not sure
why).  The resulting URI might look like this:

  http://127.0.0.1/bin/test.cgi?cs=%E2%82%AC

CGI.pm will automagically decode that correctly as you can see
for yourself with this test script:

---------------------------------------------------------------
#!/usr/bin/perl -w

use utf8;
use CGI;

my $cgi = new CGI;
my $sym = $cgi->param('cs') || '';

print <<EOF
Content-type: text/html; charset=UTF-8

<html>
<body>
<p>$sym</p>
</body>
</html>
EOF
---------------------------------------------------------------

Regards
Grant



Reply via email to