Re: Semicolon URI encoding and RFC

André Warnier Mon, 09 May 2011 08:12:06 -0700

Hi.

This whole question is a pain in the a.. , and I personally do not understand how amillion marketing people can be talking of "web 2.0" and "web 3.0", but not have been ableto come out with HTTP 2.0 where URLs (and everything else) would be by defaultUnicode/UTF-8 instead of ASCII and/or ISO-latin-1.

But things being what they are, to answer your question to the best of my abilities, andtrying to avoid jargon and twisted language :- Basically, a URL "in transit" between a client and a server, should contain only *bytes*with individual byte values between 0 and 127 decimal.Thus when it is about to send a URL to a server, any client should examine the URLbyte-by-byte, and if any of these bytes would be outside the 0-127 range, it shouldreplace it by a 3-byte sequence %xy, where xy is the hexadecimal representation of thebyte value.And then there are some additional rules for some of the bytes 0-127, which either forbidthem in a URL, or also specify that you have to encode them with the %xy logic, ordifferently (like a space encoded as a "+", and a "+" encoded as %xy), and/or when (asKonstantin explains below for the ";").

At the server side, the first thing which the server should do with this URL, is to makethe inverse translation : examine the URL and replace any %xy sequence by the single bytevalue which this sequence represented in transit (and "+" by space).


And /then/ starts the circus.

Because there is nothing in the RFCs that would enable the server to know, after thisURL-decoding, in which character set the client expressed this URL.

So basically, the interpretation of at least part of the URL falls to the server-sideapplication, and the client is supposed to send "the right thing" so that the applicationdoes not get confused. And there is no real way for the server to force the client to dothe right thing.And if either side does not respect whatever convention they have between them, one of thesides will get confused.


To my knowledge, there exists no Internet RFC which contradicts what I am 
writing above.

It is a definite hole in the specs, and one which nowadays is costing a lot of time beinglost in confusion and half-way patching attempts (*).I can understand that when HTTP 1.0 was first defined 15 years ago now, this was aperfectly valid position to take. But I personally do not understand why nowadays, 15years and 100 million worldwide webservers later, and now that Unicode/UTF-8 support isubiquitous, we are still at the same point.



(*) such as IE's "always send URLs as UTF-8", and Tomcat's 
"useBodyEncodingForURL" hacks.




Mindaugas Žakšauskas wrote:

On Mon, May 9, 2011 at 2:03 PM, Konstantin Kolinko
<knst.koli...@gmail.com> wrote:
<..>

If ";" is part of the actual path, it must be escaped.

If ";" starts a "path parameter" it must be unescaped. One well-known
example is ";jsessionid" path parameter.


Thanks for your answer. Is this rule is just "de facto" rule, or is it
documented anywhere in RFC3986/RFC2396?

Extending my question, is there a clear criteria which would define
which characters always need escaping and which don't? At the moment I
am escaping everything that is not unreserved [1], but I am not sure
about SEOability and user-friendliness - this especially concerns path
with international characters in URLs, e.g. http://site/pathąčęė

I have also found a similar Tomcat bug [2], but it is addressing
slightly different issue.

If anyone wants to benefit this, I have just added 50 bonus points to
my SO question [3]. The main question I want to get answer for is -
which characters can and which need escaping, both in terms of RFC and
Tomcat.

Regards,
Mindaugas

1. According to RFC 3986, unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
2. https://issues.apache.org/bugzilla/show_bug.cgi?id=51132
3. 
http://stackoverflow.com/questions/5913623/rfc3986-which-pchars-need-to-be-percent-encoded

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Semicolon URI encoding and RFC

Reply via email to