2016-11-18 19:02 GMT+03:00 Christopher Schultz <[email protected]>: > André, > > On 11/18/16 3:50 AM, André Warnier (tomcat) wrote: >> On 18.11.2016 05:56, Christopher Schultz wrote: >>> Since UTF-8 is supposed to be the "official" character encoding, >> >> Now where is that specified ? As far as I know, the default >> charset for everything HTTP and HTML-wise is still iso-8859-1, no ? >> (and unfortunately so). > > I apologize for the sloppy language: this particular vendor's service > claims that UTF-8 if the standard *for their service*. Not for HTTP in > general. > >>> The vendor has responded with (paraphrasing) "it seems we don't >>> completely follow this standard; we're considering what to do >>> next, which may include no change". This is a big vendor with >>> *lots* of software clients, so maintaining backward compatibility >>> is going to be a big deal for them. I've got some tricks up my >>> sleeve if they decide not to change anything. Hooray for specs. >>> :( >> >> What I never understood in all that, is why browsers and other >> clients never seem to respect (and servers do not seem to enforce) >> what is indicated here : >> >> https://www.ietf.org/rfc/rfc2388.txt 4.5 Charset of text in form >> data >> >> This would be a simple way to get rid of umpteen character >> set/encoding issues encountered when trying to interpret <form> >> data POSTed to web applications. > > The problem is that application/x-www-form-urlencoded doesn't give a > client a natural way to specify the character encoding, and a/xwfu can > be used inside of a multipart/form-data package as well. You've just > moved the problem from the Content-Type of the request to the > Content-Type of the *part* of the multi-part request. Nothing has been > solved by using multipart/form-data. > > And browsers certainly DO use that, but almost exclusively for things > like file-upload, since files tend to be very big already, and > urlencoding a bunch of binary bytes makes the file size increase quite > a bit. > >> It seems to me contrary to common sense that in our day and age, >> the rules for this could not be set once and for all to something >> like : >> >> 1) the default character set/encoding of HTTP and HTML is >> Unicode/UTF-8 (instead of the current really archaic iso-8859-1) 2) >> URLs (including query-strings) should be by default interpreted as >> Unicode/UTF-8, encoded as per >> https://tools.ietf.org/html/rfc3986#section-2 3) for POST requests >> : - for the Content-type "application/x-www-form-urlencoded", >> there SHOULD be a charset attribute indicating the charset and >> encoding. By default, this is "text/plain; charset=UTF-8" > > Don't forget, charset == encoding. The text/plain is the MIME type, > and that's already been defined as application/x-www-form-urlencoded. > Somewhere it should just explicitly say "a/xwfu" must contain only > ASCII bytes, and always encodes a text blob in UTF-8 encoding. > > But it will never happen (see below).
One more authority, that I forgot to mention in my mail: IANA registry of mime types Registry: https://www.iana.org/assignments/media-types/media-types.xhtml Registration entry for "application/x-www-form-urlencoded" https://www.iana.org/assignments/media-types/application/x-www-form-urlencoded -> Encoding considerations : 7bit According to RFC defining this registry, it means that the data is 7-bit ASCII only. https://tools.ietf.org/html/rfc6838#section-4.8 -> Required parameters : No parameters -> Optional parameters : No parameters OK. So no charset= parameter is allowed. My advise to specify the charset parameter was wrong. Though historically ~10 years ago I saw "application/x-www-form-urlencoded;charset=UTF-8" Content-Type in the wild. It was a web site authored in WML (Wireless Markup Language) and accessed via WAP protocol by mobile phones. (Specification reference for this WML/WAP usage: http://technical.openmobilealliance.org/Technical/release_program/docs/Browsing/V2_3-20070227-C/WAP-191-WML-20000219-a.pdf Document title: WAP WML WAP-191-WML 19 February 2000 Wireless Application Protocol Wireless Markup Language Specification Version 1.3 -> Page 30 of 110 (in Section "9.5.1 The Go Element"): There is a table, where the following line is relevant: Method: post Enctype: application/x-www-form-urlencoded Process: [...] The Content-Type header must include the charset parameter to indicate the character encoding. I suspect that the above URL is not the official location of the document. I found it through Googling. Official location should be http://www.wapforum.org/what/technical.htm ) Apache Tomcat supports the use of charset parameter with Content-Type application/x-www-form-urlencoded in POST requests. >> - for the Content-type "multipart/form-data", each "part" MUST have >> a Content-type header. If this Content-type is a "text" type, then >> the Content-type header SHOULD contain a charset attribute. If >> omitted, by default this is "charset=UTF-8". >> >> and be done with it once and for all. > > Right: once and for all, for new clients who implement the spec. All > old clients, servers, proxies, , etc. be damned. It's just not > possible due to the need to be backward-compatible with really weird > stuff like "smart" toasters and refrigerators, WebTV (remember that?) > and all manner of embedded devices that will never be updated. > > What we really need is a new header that says "here's everything you > need to know about encoding for this request" and clients and servers > who both support that header can use it. All other uses need to > fall-back to this old and nasty heuristic. > > - -chris Best regards, Konstantin Kolinko --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
