On Fri, 2008-01-18 at 09:47 -0500, Tim Julien wrote: > All, > > I've spent a few days looking into some strange URL encoding issues on > http client 4.0 alpha 2. I'll describe some things I've found, > hopefully I am thinking about this correctly. > > I think there is a regression from 3.0 -> 4.0 due to the use of java.net.URI >
That was to be expected. The sole reason for not porting URI class from HttpClient 3.x and using j.u.URI instead is the fact that the URI code is a horrible mess no one wants to maintain, even though it arguably has a more flexible API. > On the old commons http client stack, we encoded URLs using > java.net.URLEncoder, and passed them to the > org.apache.commons.httpclient.URI() constructors. Those constructors > had a boolean parameter that indicated whether the url was encoded. > > On the new 4.0 stack, java.net.URI is used instead - and apparently it > has some strange encoding behavior. For starters, you cannot specify > whether the URL is encoded. Instead - URI's constructed with the > single-arg constructor are treated as encoded - while URI's constructed > with the multi-arg constructors are treated as un-encoded. When using > the multi-arg constructors, java.net.URI will perform encoding for you. > > example: > uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon doe", null); > > uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%20doe > > This is correct (the space is encoded to %20). > > The trouble comes with certain characters that the URL RFC 2396 > designates as "reserved". "Reserved" characters are those that help > give URI's their structure: > > reserved = ;" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | > "$" | "," > > Those characters are also allowed to be used in a non-reserved fashion - > for example as values within a query string. In such cases, you are > required to URL encode them, effectively "escaping" them. > > And it seems that the multi-arg constructors, which do URL encoding for > you, do NOT provide a way for you to encode these characters - which > means you can only ever use them for their reserved (unescaped) purpose. > > For example, suppose I want to produce this URL: > > http://foo.com/bar?a=b&c=jon%26doe > > // %26 is the encoded value of & > // %25 is the encoded value of % > > uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon%26doe", null); > uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%2526doe > > // java.net.URI encodes the incoming "%" as %25 > > uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon&doe", null); > uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon&doe > > // java.net.URI has no way of knowing that the un-escaped "&" is > //actually a value in the URI > > The upshot of all of this is that I claim the multi-arg constructors are > unusable, unless you restrict your URLs to to never use reserved > characters as values. In our use case, we can't do that because we > don't control what URLs are incoming / outgoing. > > (Note that I can produce the desired URIs, if I use the single-arg > constructor and do all of the encoding myself before hand) > Here's my take. There is nothing wrong with j.u.URI as such. It just needs a better parser that can deal with escaped and unescaped queries, as well as be more lenient about common non-compliant behaviors, and then construct the URI instance using a multi-arg constructor. It was long on my virtual to-do list to open a feature request for pluggable URI parsers in JIRA. Probably it is about time. Would that work for LimeWire? Oleg > This ends up being a problem on http client 4.0, because the URI passed > in is reconstructed a few times under the covers by http client - using > the multi-arg constructors. I believe that the multi-arg constructors > have to be replaced with single-arg constructors. > > -Tim Julien > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]