On Wed, Apr 13, 2005 at 09:52:41AM -0400, Stevan Little wrote:
> On Apr 13, 2005, at 9:20 AM, B�RTH�ZI Andr�s wrote:
> >As Pugs works in UTF-8, my page is coded in UTF-8, too (and there are
> >some other reasons, too). When I try to send an accented charater to
> >the server as parameter, for example the euro character, I get back an
> >UTF-8 coded character:
> >
> > ...?test=%E2%82%AC
> >
> >It's OK, but when my code (and CGI.pm as well) try to decode it, it
> >will give back three characters and not just one.
> >
> >The problem is with this line in sub url_decode():
> >
> > $decoded ~~ s:perl5:g/%([\da-fA-F][\da-fA-F])/{chr(hex($1))}/;
> >
> >Have any idea, how to solve it? I think I should transform this code
> >to recognize multi-bytes, decode the character value, and after it use
> >chr on this value. Or is there a way to do it by not creating
> >character by chr(), but a byte with another function?
>
> To be honest, my experience with multi-byte character sets is very
> limited (my first real exposure is on the Pugs project). However, I
> think/hope that maybe the chr() builtin will eventually be able to
> handle multi-bytes itself. In the (non-working) port of CGI-Lite
> (http://tpe.freepan.org/repos/iblech/CGI-Lite/lib/CGI/Lite.pm), I saw
> code which did this:
>
> /%(<[\da-fA-F]>**{2})/{chr :16($1)}/
>
> Of course it was followed by this comment "# XXX -- correct?" so it may
> not be anything official yet.
The trick is that URL encoding encodes bytes, not characters:
http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
So in the regex we have to determine whether we are unencoding a
single-byte or multi-byte character.
Both
s:perl5:g/%([\da-fA-F][\da-fA-F])/{chr(hex($1))}/
and
/%(<[\da-fA-F]>**{2})/{chr :16($1)}/
read in a single byte and pass it to chr(). I do not have enough
experience with multi-byte characters to know when a byte can be
recognized as the first byte of a multi-byte character, and thus grab
the next byte before passing to chr().
-kolibrie