On Tue, 2016-12-27 at 10:42 -0500, Jaime Hablutzel Egoavil wrote: > From RFC 3986: > > > > > > When a new URI scheme defines a component that represents textual > > data consisting of characters from the Universal Character Set [UCS], > > the data should first be encoded as octets according to the UTF-8 > > character encoding [STD63]; then only those octets that do not > > correspond to characters in the unreserved set *should be* percentencoded. > > For example, the character A would be represented as "A", > > the character LATIN CAPITAL LETTER A WITH GRAVE would be represented > > as "%C3%80", and the character KATAKANA LETTER A would be represented > > as "%E3%82%A2". > > > As you can see it says "should" so it seems to me that it is not an > obligation to percent encode non-ASCII. > > A real example where it this problem arises is with Firefox invoking custom > URI handlers, for example, if you have something like this in an HTML page: > > <a href="myuri:?foo=b*%C3%A1*r">Invoke myuri handler</a> > > The URI handler application will receive > > myuri:?foo=bár > > Then, during query component parsing HttpClient will fail to parse that > parameter value. >
Both HTTP/1.1 and HTTP/2 require message head elements including the request URI to be ASCII only. Oleg > > > > > On Tue, Dec 27, 2016 at 10:11 AM, Oleg Kalnichevski <[email protected]> > wrote: > > > On Sat, 2016-12-24 at 18:26 -0500, Jaime Hablutzel Egoavil wrote: > > > Currently something like this: > > > > > > public class ProblemWithNonAscii { > > > public static void main(String[] args) { > > > List<NameValuePair> pairs = URLEncodedUtils.parse("foo=bár", > > > StandardCharsets.UTF_8); > > > System.out.println(pairs); > > > } > > > } > > > > > > produces this output: > > > > > > [foo=b�r] > > > > > > Where the 'á' character has been scrambled. > > > > > > I can see that this is related to the following narrowing primitive > > > conversion, > > > https://github.com/apache/httpclient/blob/4.5.2/ > > httpclient/src/main/java/org/apache/http/client/utils/ > > URLEncodedUtils.java#L570 > > > . > > > > > > Is this a bug isn't it?. > > > > > > > Jaime, > > > > URL encoded content is not supposed to have non-ASCII characters in the > > first place, is it not? > > > > Oleg > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
