Hi, Mark, and thanks for some quick response. You provided some info I wasn't aware of. Some responses below:

On 1/8/2019 9:57 PM, Mark Thomas wrote:
On 08/01/2019 21:31, Garret Wilson wrote:

<snip/>

But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded.

That is not the correct interpretation of section 3.12 of the Servlet 4.0 specification (note the section numbers do vary between spec versions). Tomcat implements the correct interpretation - i.e. the charset from the request content-type defines how encoded octets are decoded and, if none is specified, ISO-8859-1 is used as the default.


Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat is correctly following the spec, but I would still say the servlet spec is wrong to make any linkage at all between resource encoding and %nn interpretation. In fact reading the prose it's not clear to me that the servlet spec is even strongly tying the %nn interpretation to the encoding. It just sees to say that, unless otherwise specified, the %nn interpretation should be ISO-8859-1. And actually that's a step up from the HTML 4.0.1 spec, which in https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates that they should be interpreted as US-ASCII codes. :(

You indicate that this is all out of date, and I think we're in agreement there. We really, really need to get the next servlet specification to remove this part. In fact the servlet specification should defer to the official `application/x-www-form-urlencoded` specification, which at this point I think is the W3C HTML5 spec, which in turn defers to the WHATWG spec (which clearly says that UTF-8) should be used. What makes all of this more of a mess is that there seems to be no way to work around this from the client side, e.g. by putting something in the HTML to indicate UTF-8, as `application/x-www-form-urlencoded` doesn't support a `charset` parameter.

Anyway if there are any openings on the committee to update the servlet spec, let me know.


...
As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml:

<request-character-encoding>UTF-8</request-character-encoding>

Oh, that is really good to know, thanks!! Still I say that the request character encoding is orthogonal to the %nn encoding, but, still, it's good to have an implementation-agnostic way to do it.



Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list.


Yes please! If I can help in any way, let me know.



The Tomcat Wiki also needs to be updated to take account of this new configuration option (and the related <response-character-encoding>). Since it is a wiki and this is clearly an issue you care about would you like to tackle that?


Yes, I'd love to. Let me know what permissions I need, etc.

I have an international flight boarding right now so I have to go, and I may not reply for the next few hours, but definitely sign me up.

Thanks,

Garret


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to