unable to encode reserved characters using java.net.URI multi-arg constructors

Tim Julien Fri, 18 Jan 2008 06:49:16 -0800

All,

I've spent a few days looking into some strange URL encoding issues on

http client 4.0 alpha 2. I'll describe some things I've found,hopefully I am thinking about this correctly.


I think there is a regression from 3.0 -> 4.0 due to the use of java.net.URI

On the old commons http client stack, we encoded URLs using
java.net.URLEncoder, and passed them to the
org.apache.commons.httpclient.URI() constructors.  Those constructors
had a boolean parameter that indicated whether the url was encoded.

On the new 4.0 stack, java.net.URI is used instead - and apparently it
has some strange encoding behavior.  For starters, you cannot specify
whether the URL is encoded.  Instead - URI's constructed with the
single-arg constructor are treated as encoded - while URI's constructed

with the multi-arg constructors are treated as un-encoded. When usingthe multi-arg constructors, java.net.URI will perform encoding for you.


example:
uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon doe", null);

uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%20doe

This is correct (the space is encoded to %20).

The trouble comes with certain characters that the URL RFC 2396designates as "reserved". "Reserved" characters are those that helpgive URI's their structure:


reserved = ;" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                    "$" | ","

Those characters are also allowed to be used in a non-reserved fashion -for example as values within a query string. In such cases, you arerequired to URL encode them, effectively "escaping" them.

And it seems that the multi-arg constructors, which do URL encoding foryou, do NOT provide a way for you to encode these characters - whichmeans you can only ever use them for their reserved (unescaped) purpose.


For example, suppose I want to produce this URL:

http://foo.com/bar?a=b&c=jon%26doe

// %26 is the encoded value of &
// %25 is the encoded value of %

uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon%26doe", null);
uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon%2526doe

// java.net.URI encodes the incoming "%" as %25

uri = new URI("http", null, "foo.com", -1, "/bar", "a=b&c=jon&doe", null);
uri.toASCIIString() -> http://foo.com/bar?a=b&c=jon&doe

// java.net.URI has no way of knowing that the un-escaped "&" is//actually a value in the URI

The upshot of all of this is that I claim the multi-arg constructors areunusable, unless you restrict your URLs to to never use reservedcharacters as values. In our use case, we can't do that because wedon't control what URLs are incoming / outgoing.

(Note that I can produce the desired URIs, if I use the single-argconstructor and do all of the encoding myself before hand)

This ends up being a problem on http client 4.0, because the URI passedin is reconstructed a few times under the covers by http client - usingthe multi-arg constructors. I believe that the multi-arg constructorshave to be replaced with single-arg constructors.


-Tim Julien





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

unable to encode reserved characters using java.net.URI multi-arg constructors

Reply via email to