HTTP::Cookies and URI character encodings

John J Lee Mon, 24 Nov 2003 17:48:08 -0800

I think there might be a problem with _normalize_path, from HTTP::Cookies.
I'll explain what happens with my Python port, because I have no idea how
Perl and unicode interact: a unicode URI got passed to my equivalent of
_normalize_path() (a unicode string is a separate type from an ordinary
byte-string in Python).  That function complained because there were
non-ASCII characters in the unicode string, and it refused to guess which
encoding to use.


The stated purpose of _normalize_path is to allow plain string-comparison
of HTTP URI paths, but I don't understand a) how that's possible given
that the URI character set isn't always known, and b) why it's necessary
-- why not just compare without any normalization?

The trouble is, RFC 2396 doesn't specify any URI character encoding, but
allows %-escapes, which are defined in terms of octets.  So, when you see
a URI containing %-escaped chars, you have to know the original URI
character encoding in order to work out what characters they represent.
Unfortunately, I don't think that's always possible (is it?), so
normalizing to "fully-escaped" form (as _normalize_path does) may involve
assuming a different encoding than was used to partially escape the URI
before HTTP::Cookies had anything to do with it.  Escaping with
inconsistent character encodings certainly seems bad.

Am I correct?  Why not just leave URIs un-normalized?  If they must be
normalized, how should unicode URIs (or non-ASCII ones, generally) get
normalized?

This is all very confusing, especially to an English speaker who never
reads or writes anything but ASCII!


John

HTTP::Cookies and URI character encodings

Reply via email to