On 2/6/2020 11:46 AM, André Warnier (tomcat/perl) wrote:
As of Tomcat 10, conf/web.xml contains the following:

<!--
   Set the default request and response character encodings to UTF-8.
-->
<request-character-encoding>UTF-8</request-character-encoding>
<response-character-encoding>UTF-8</response-character-encoding>

That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.


As I am sure many people (Christopher included) would agree, the real solution would be for browsers and other HTTP clients to indicate clearly in the request, the charset/encoding of each text parameter that they are sending.
There are even HTTP headers already defined for that.


Which HTTP headers are you referring to? `Content-Type`? It is my opinion that this is irrelevant and not applicable.

As I explained (extensively) in my original post for this thread back on 2019-01-08, the issue is not the charset of `application/x-www-form-urlencoded`. That media type is made up of ASCII characters. It doesn't matter whether you say it's ASCII, ISO-8859-1, UTF-8, or whatever, the actual characters stay 100% the same. At issue is when certain octets are encoded (as specified by the `application/x-www-form-urlencoded` media type itself), what charset to use when decoding them. This is independent of the encoding of the media type itself; rather this is defined by the specification for the format.

Unfortunately https://tools.ietf.org/html/rfc1866 actually says we should use ASCII when decoding the octets, but this is severely antiquated and doesn't fit with modern practice. The WhatWG essentially redefines the format to say that the octets must be interpreted as UTF-8:

https://url.spec.whatwg.org/#application/x-www-form-urlencoded

So to summarize my view:

 * The decoding of the `application/x-www-form-urlencoded` media type
   encoded octets is completely independent of the charset indicated in
   the `Content-Type` header, and rather goes to the specification of
   the format itself.
 * RFC 1866 is severely out of date and out of step, and the WhatWG's
   specification of the `application/x-www-form-urlencoded` media type
   should be used instead. (Modern browser practice would seem to agree
   with me.)
 * Therefore `web.xml` settings, HTTP headers, etc. are all irrelevant,
   as this is an issue dealing with the file format itself, and the
   latest spec for the file format says to use UTF-8, so everyone
   should use UTF-8 already.

The new default `web.xml` in Tomcat 10 is a wonderful step in the right direction.

See my original post for more in-depth explanation.

Garret

Reply via email to