On Fri, 2008-01-18 at 09:47 -0500, Tim Julien wrote:
> All,
> 
> I've spent a few days looking into some strange URL encoding issues on
> http client 4.0 alpha 2.  I'll describe some things I've found, 
> hopefully I am thinking about this correctly.
> 
> I think there is a regression from 3.0 -> 4.0 due to the use of java.net.URI
> 

That was to be expected. The sole reason for not porting URI class from
HttpClient 3.x and using j.u.URI instead is the fact that the URI code
is a horrible mess no one wants to maintain, even though it arguably has
a more flexible API.  


> On the old commons http client stack, we encoded URLs using
> java.net.URLEncoder, and passed them to the
> org.apache.commons.httpclient.URI() constructors.  Those constructors
> had a boolean parameter that indicated whether the url was encoded.
> 
> On the new 4.0 stack, java.net.URI is used instead - and apparently it
> has some strange encoding behavior.  For starters, you cannot specify
> whether the URL is encoded.  Instead - URI's constructed with the
> single-arg constructor are treated as encoded - while URI's constructed
> with the multi-arg constructors are treated as un-encoded.  When using 
> the multi-arg constructors, java.net.URI will perform encoding for you.
> 
> example:
> uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon doe", null);
> 
> uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%20doe
> 
> This is correct (the space is encoded to %20).
> 
> The trouble comes with certain characters that the URL RFC 2396 
> designates as "reserved".  "Reserved" characters are those that help 
> give URI's their structure:
> 
> reserved = ;" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
>                      "$" | ","
> 
> Those characters are also allowed to be used in a non-reserved fashion - 
> for example as values within a query string.  In such cases, you are 
> required to URL encode them, effectively "escaping" them.
> 
> And it seems that the multi-arg constructors, which do URL encoding for 
> you, do NOT provide a way for you to encode these characters - which 
> means you can only ever use them for their reserved (unescaped) purpose.
> 
> For example, suppose I want to produce this URL:
> 
> http://foo.com/bar?a=b&c=jon%26doe
> 
> // %26 is the encoded value of &
> // %25 is the encoded value of %
> 
> uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon%26doe", null);
> uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%2526doe
> 
> // java.net.URI encodes the incoming "%" as %25
> 
> uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon&doe", null);
> uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon&doe
> 
> // java.net.URI has no way of knowing that the un-escaped "&" is 
> //actually a value in the URI
> 
> The upshot of all of this is that I claim the multi-arg constructors are 
> unusable, unless you restrict your URLs to to never use reserved 
> characters as values.  In our use case, we can't do that because we 
> don't control what URLs are incoming / outgoing.
> 
> (Note that I can produce the desired URIs, if I use the single-arg 
> constructor and do all of the encoding myself before hand)
> 

Here's my take. There is nothing wrong with j.u.URI as such. It just
needs a better parser that can deal with escaped and unescaped queries,
as well as be more lenient about common non-compliant behaviors, and
then construct the URI instance using a multi-arg constructor. It was
long on my virtual to-do list to open a feature request for pluggable
URI parsers in JIRA. Probably it is about time.

Would that work for LimeWire?

Oleg

> This ends up being a problem on http client 4.0, because the URI passed 
> in is reconstructed a few times under the covers by http client - using 
> the multi-arg constructors.  I believe that the multi-arg constructors 
> have to be replaced with single-arg constructors.
> 
> -Tim Julien
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to