Hi.

This whole question is a pain in the a.. , and I personally do not understand how a million marketing people can be talking of "web 2.0" and "web 3.0", but not have been able to come out with HTTP 2.0 where URLs (and everything else) would be by default Unicode/UTF-8 instead of ASCII and/or ISO-latin-1.

But things being what they are, to answer your question to the best of my abilities, and trying to avoid jargon and twisted language : - Basically, a URL "in transit" between a client and a server, should contain only *bytes* with individual byte values between 0 and 127 decimal. Thus when it is about to send a URL to a server, any client should examine the URL byte-by-byte, and if any of these bytes would be outside the 0-127 range, it should replace it by a 3-byte sequence %xy, where xy is the hexadecimal representation of the byte value. And then there are some additional rules for some of the bytes 0-127, which either forbid them in a URL, or also specify that you have to encode them with the %xy logic, or differently (like a space encoded as a "+", and a "+" encoded as %xy), and/or when (as Konstantin explains below for the ";").

At the server side, the first thing which the server should do with this URL, is to make the inverse translation : examine the URL and replace any %xy sequence by the single byte value which this sequence represented in transit (and "+" by space).

And /then/ starts the circus.

Because there is nothing in the RFCs that would enable the server to know, after this URL-decoding, in which character set the client expressed this URL.

So basically, the interpretation of at least part of the URL falls to the server-side application, and the client is supposed to send "the right thing" so that the application does not get confused. And there is no real way for the server to force the client to do the right thing. And if either side does not respect whatever convention they have between them, one of the sides will get confused.

To my knowledge, there exists no Internet RFC which contradicts what I am 
writing above.
It is a definite hole in the specs, and one which nowadays is costing a lot of time being lost in confusion and half-way patching attempts (*). I can understand that when HTTP 1.0 was first defined 15 years ago now, this was a perfectly valid position to take. But I personally do not understand why nowadays, 15 years and 100 million worldwide webservers later, and now that Unicode/UTF-8 support is ubiquitous, we are still at the same point.


(*) such as IE's "always send URLs as UTF-8", and Tomcat's 
"useBodyEncodingForURL" hacks.




Mindaugas Žakšauskas wrote:
On Mon, May 9, 2011 at 2:03 PM, Konstantin Kolinko
<knst.koli...@gmail.com> wrote:
<..>
If ";" is part of the actual path, it must be escaped.

If ";" starts a "path parameter" it must be unescaped. One well-known
example is ";jsessionid" path parameter.

Thanks for your answer. Is this rule is just "de facto" rule, or is it
documented anywhere in RFC3986/RFC2396?

Extending my question, is there a clear criteria which would define
which characters always need escaping and which don't? At the moment I
am escaping everything that is not unreserved [1], but I am not sure
about SEOability and user-friendliness - this especially concerns path
with international characters in URLs, e.g. http://site/pathąčęė

I have also found a similar Tomcat bug [2], but it is addressing
slightly different issue.

If anyone wants to benefit this, I have just added 50 bonus points to
my SO question [3]. The main question I want to get answer for is -
which characters can and which need escaping, both in terms of RFC and
Tomcat.

Regards,
Mindaugas

1. According to RFC 3986, unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
2. https://issues.apache.org/bugzilla/show_bug.cgi?id=51132
3. 
http://stackoverflow.com/questions/5913623/rfc3986-which-pchars-need-to-be-percent-encoded

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to