On 18.11.2016 17:02, Christopher Schultz wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

André,

On 11/18/16 3:50 AM, André Warnier (tomcat) wrote:
On 18.11.2016 05:56, Christopher Schultz wrote:
Since UTF-8 is supposed to be the "official" character encoding,

Now where is that specified ?  As far as I know, the default
charset for everything HTTP and HTML-wise is still iso-8859-1, no ?
(and unfortunately so).

I apologize for the sloppy language: this particular vendor's service
claims that UTF-8 if the standard *for their service*. Not for HTTP in
general.

The vendor has responded with (paraphrasing) "it seems we don't
completely follow this standard; we're considering what to do
next, which may include no change". This is a big vendor with
*lots* of software clients, so maintaining backward compatibility
is going to be a big deal for them. I've got some tricks up my
sleeve if they decide not to change anything. Hooray for specs.
:(

What I never understood in all that, is why browsers and other
clients never seem to respect (and servers do not seem to enforce)
what is indicated here :

https://www.ietf.org/rfc/rfc2388.txt 4.5 Charset of text in form
data

This would be a simple way to get rid of umpteen character
set/encoding issues encountered when trying to interpret <form>
data POSTed to web applications.

The problem is that application/x-www-form-urlencoded doesn't give a
client a natural way to specify the character encoding,

Yes, it does. In the case of this content-type, the whole list of posted parameters is provided as one big chunk of text, in the body of the request. The content-type "application/x-www-form-urlencoded" implies text, because there is no good way in that format to include any post parameter which is not text. Since it is text, there is no good reason why the (single) Content-type header of the POST could not provide a charset attribute.

 and a/xwfu can
be used inside of a multipart/form-data package as well. You've just
moved the problem from the Content-Type of the request to the
Content-Type of the *part* of the multi-part request. Nothing has been
solved by using multipart/form-data.

I have no changed or moved anything. I have just been adding the requirement that if any of these parts concerns a text-type part, it SHOULD also contain a charset attribute.

This is precisely what browsers do not do, for whatever reason which is beyond my comprehension. The parts already have a Content-type. It is just the charset attribute *for the parts which are text* that is missing, despite what the rfc2388 recommendation says.


And browsers certainly DO use that, but almost exclusively for things
like file-upload, since files tend to be very big already, and
urlencoding a bunch of binary bytes makes the file size increase quite
a bit.

It seems to me contrary to common sense that in our day and age,
the rules for this could not be set once and for all to something
like :

1) the default character set/encoding of HTTP and HTML is
Unicode/UTF-8 (instead of the current really archaic iso-8859-1)

2)
URLs (including query-strings) should be by default interpreted as
Unicode/UTF-8, encoded as per
https://tools.ietf.org/html/rfc3986#section-2

3) for POST requests
: - for the Content-type "application/x-www-form-urlencoded",
there SHOULD be a charset attribute indicating the charset and
encoding. By default, this is "text/plain; charset=UTF-8"

Don't forget, charset == encoding. The text/plain is the MIME type,
and that's already been defined as application/x-www-form-urlencoded.

I made a mistake here. Scratch the "text/plain;" part above. The charset attribute should be added to the existing Content-type header.
In other words, the header should be :
Content-type: application/x-www-form-urlencoded; charset=xxxx
The MIME type "x-www-form-urlencoded" already *implies* that this is text, 
URL-encoded.
It just fails to specify what charset/encoding the query string was encoded with, *before* it was URL-encoded.

Somewhere it should just explicitly say "a/xwfu" must contain only
ASCII bytes, and always encodes a text blob in UTF-8 encoding.

But it will never happen (see below).

- for the Content-type "multipart/form-data", each "part" MUST have
a Content-type header.  If this Content-type is a "text" type, then
the Content-type header SHOULD contain a charset attribute. If
omitted, by default this is "charset=UTF-8".

and be done with it once and for all.

Right: once and for all, for new clients who implement the spec. All
old clients, servers, proxies, , etc. be damned. It's just not
possible due to the need to be backward-compatible with really weird
stuff like "smart" toasters and refrigerators, WebTV (remember that?)
and all manner of embedded devices that will never be updated.

What we really need is a new header that says "here's everything you
need to know about encoding for this request"

There is no need for a new header. The existing "Content-type" header is perfectly adequate in all cases. It is the fact that it is not being used properly and consistently that is the problem.

The backward-compatibility issue is also not a real one, as you mention 
yourself below.

and clients and servers
who both support that header can use it. All other uses need to
fall-back to this old and nasty heuristic.


Indeed. And this would not be the first time, by far, that sloppy behaviour of clients is penalised by tighter interpretation of the rules by webservers.
But it probably falls upon webservers to initiate the movement.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to