From RFC2396:

   For original character sequences that contain non-ASCII characters,
   however, the situation is more difficult. Internet protocols that
   transmit octet sequences intended to represent character sequences
   are expected to provide some way of identifying the charset used, if
   there might be more than one [RFC2277].  However, there is currently
   no provision within the generic URI syntax to accomplish this
   identification. An individual URI scheme may require a single
   charset, define a default charset, or provide a way to indicate the
   charset used.

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this

So there's no right answer here. The IETF seems to be moving towards using UTF-8 as the international charset so we may as well use it. I have been unable to find a browser that can correctly handle anything outside of ISO8859-1 charset however - double byte characters are a really great way to screw things up.

So in essence - don't put non-ASCII characters in URLs there is no official way to support them. We should however give it a shot by using UTF-8 since it is "compatible" with ASCII anyway.


Adrian Sutton.

On Friday, July 11, 2003, at 03:11 AM, Oleg Kalnichevski wrote:

This is one of many 'shady' areas of the HTTP spec. Basically there is
no standard way for the client to communicate to the server what coding
has been used to decode query parameters. I believe some browsers use
'Accept-charset" or 'Accept-Language' headers to negotiate the locale
settings to be used by the server. But I am not sure it these headers
can be used to determine what character coding can be used to decode
URL-encoded data.

I think we definitely should not be using US-ASCII per default. The
whole point of URL encoding is to escape non-ASCII characters. I suggest
UTF-8 be used per default.


On Thu, 2003-07-10 at 17:48, Michael Becke wrote:
Hello Martin,

This is a good question, one that I am not positive I know the answer
to.  The HTTP request line (containing the query params) must be
US-ASCII.  That I am sure of.  The catch is that form urlencoding
strings makes them ASCII, regardless of the original charset.  So
HttpMethod.setQueryString(NameValuePair[]) is assuming that the
inputs(query params) are ASCII when really only the output(encoded
params) should be ASCII.

The question is how does one determine, on the client and the server,
what the charset of the query params is? The request charset can be
specified with the Content-Type header, but this is meant to apply to
the request entity, not the headers. I have a feeling that we should
probably be using the content charset anyway. My reasoning here is that
an HTML form can be sent via a GET(query params) or POST(post content).
In both cases the content must be form urlencoded and my feeling is
that it should be done the same for both.

What does everyone else think?


Martin Schnyder wrote:
When I use the GetMethod class to send text with special characters (German
Umlaute "äöü") in the request parameters, the special characters are not
encoded correctly. This happens when I use method
HttpMethodBase.setQueryString(NameValuePair[] params)
to set the query parameters.

I saw that Release 2.0 Beta 2 fixed that with bug fix 20481. Special
characters are now encoded differently but still wrong, as far as I can see.

Method HttpMethodBase.setQueryString(NameValuePair[]) calls
formUrlEncode(params, HttpConstants.HTTP_ELEMENT_CHARSET) to encode the
parameters. The value of HTTP_ELEMENT_CHARSET is US-ASCII. When I change the
charset to HttpConstants.DEFAULT_CONTENT_CHARSET (which is ISO-8859-1), the
German "Umlaute" are encoded correctly. I checked that with the code in CVS
HEAD. Is this a bug or should really only the US-ASCII characters be
supported in a request URI?

Martin Schnyder

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to