All,

I've spent a few days looking into some strange URL encoding issues on
http client 4.0 alpha 2. I'll describe some things I've found, hopefully I am thinking about this correctly.

I think there is a regression from 3.0 -> 4.0 due to the use of java.net.URI

On the old commons http client stack, we encoded URLs using
java.net.URLEncoder, and passed them to the
org.apache.commons.httpclient.URI() constructors.  Those constructors
had a boolean parameter that indicated whether the url was encoded.

On the new 4.0 stack, java.net.URI is used instead - and apparently it
has some strange encoding behavior.  For starters, you cannot specify
whether the URL is encoded.  Instead - URI's constructed with the
single-arg constructor are treated as encoded - while URI's constructed
with the multi-arg constructors are treated as un-encoded. When using the multi-arg constructors, java.net.URI will perform encoding for you.

example:
uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon doe", null);

uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%20doe

This is correct (the space is encoded to %20).

The trouble comes with certain characters that the URL RFC 2396 designates as "reserved". "Reserved" characters are those that help give URI's their structure:

reserved = ;" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

Those characters are also allowed to be used in a non-reserved fashion - for example as values within a query string. In such cases, you are required to URL encode them, effectively "escaping" them.

And it seems that the multi-arg constructors, which do URL encoding for you, do NOT provide a way for you to encode these characters - which means you can only ever use them for their reserved (unescaped) purpose.

For example, suppose I want to produce this URL:

http://foo.com/bar?a=b&c=jon%26doe

// %26 is the encoded value of &
// %25 is the encoded value of %

uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon%26doe", null);
uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%2526doe

// java.net.URI encodes the incoming "%" as %25

uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon&doe", null);
uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon&doe

// java.net.URI has no way of knowing that the un-escaped "&" is //actually a value in the URI

The upshot of all of this is that I claim the multi-arg constructors are unusable, unless you restrict your URLs to to never use reserved characters as values. In our use case, we can't do that because we don't control what URLs are incoming / outgoing.

(Note that I can produce the desired URIs, if I use the single-arg constructor and do all of the encoding myself before hand)

This ends up being a problem on http client 4.0, because the URI passed in is reconstructed a few times under the covers by http client - using the multi-arg constructors. I believe that the multi-arg constructors have to be replaced with single-arg constructors.

-Tim Julien





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to