Re: IRI support in URI and URI::Escape modules

2005-01-31 Thread Dan Kogai
On Jan 31, 2005, at 18:19, Martin Duerst wrote:
I started with some very simple (I thought) tests, but got
completely confused very quickly. Here is the short program
that I was using:
 test.pl
use utf8;
use URI;
use URI::Escape;
print (uri_escape("\xFD")
[snip]
With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail), I get

%FD
%C3%BD
[snip]
However, on perl, v5.8.4 built for i386-linux-thread-multi,
I get:

%FD
[snip]
Nothing seems to work anymore, although (or because?) 5.8
has better Unicode support.
The (easiest|new canonical) way to go is to use uri_escape_utf8() 
instead of uri_escape().  Note that as of version 3.28 
uri_escape_utf8() is NOT AUTOMATICALLY loaded.

% perl -MURI::Escape -le 'print uri_escape("\xFD")'
%FD
% perl -MURI::Escape=uri_escape_utf8 -le 'print uri_escape_utf8("\xFD")'
%C3%BD
perldoc URI::Escape
   uri_escape_utf8( $string )
   uri_escape_utf8( $string, $unsafe )
   Works like uri_escape(), but will encode chars as UTF-8 
before
   escaping them.  This makes this function able do deal with 
charac-
   ters with code above 255 in $string.  Note that chars in 
the 128 ..
   255 range will be escaped differently by this function 
compared to
   what uri_escape() would.  For chars in the 0 .. 127 range 
there is
   no difference.

   The call:
   $uri = uri_escape_utf8($string);
   will be the same as:
   use Encode qw(encode);
   $uri = uri_escape(encode("UTF-8", $string));
   but will even work for perl-5.6 for chars in the 128 .. 255 
range.
Dan the Encode Maintainer


IRI support in URI and URI::Escape modules

2005-01-31 Thread Martin Duerst
Dear Perl Unicode Experts,
I tried to have a look at how much would have to be done to get
the URI and URI::Escape modules to support IRIs in a reasonable
way. The IRI spec has just been published as an IETF Proposed
Standard at http://www.ietf.org/rfc/rfc3987.txt. Also, a new
version of the URI spec is now Internet Standard 66 and is
available at http://www.ietf.org/rfc/rfc3986.txt.
I'm looking for two things:
a) short-term, how to get IRI support using the above and maybe
   some additional modules
b) long-term, how to make these modules (and maybe others)
   work with IRIs as well as with the new URI spec
Support for these new specs mainly includes the following things:
1) Escaping with %hh is based on UTF-8, not some local character
   encoding
2) URIs now allow %hh in the host name part, and require that
   it is interpreted as UTF-8
3) IDNs (i.e. conversion to punycode, and if possibly also
   nameprep/stringprep) should be supported
4) The user of e.g. the URI module should ideally only have to
   deal with one form of the URI/IRI, the one used to construct
   the URI/IRI, although it should be possible to create other
   forms (e.g. a fully %-encoded URI, an IRI that contains
   as few %hh as possible)
5) It should be possible to apply normalization operations
   as described in the IRI spec on different parts of an URI/IRI
I started with some very simple (I thought) tests, but got
completely confused very quickly. Here is the short program
that I was using:
 test.pl
use utf8;
use URI;
use URI::Escape;
print (uri_escape("\xFD") . "\n");
print (iri_escape("\xFD") . "\n");
print (uri_escape("\x{FD}") . "\n");
print (iri_escape("\x{FD}") . "\n");
print (uri_escape("\x{370}") . "\n");
print (iri_escape("\x{370}") . "\n");
sub iri_escape
{
return substr (uri_escape("\x{370}".shift), 6);
}

With this, on perl, v5.6.1 built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail), I get

%FD
%C3%BD
%C3%BD
%C3%BD
%CD%B0
%CD%B0

which seems to show that the trick with adding a non-Latin-1
character and then removing its escaped form works (compare
the first line to the second line).
However, on perl, v5.8.4 built for i386-linux-thread-multi,
I get:

%FD
%FD


Nothing seems to work anymore, although (or because?) 5.8
has better Unicode support.
Any help appreciated.
Regards, Martin.