Shanti Suresh wrote:
Greetings,


On Wed, Jun 26, 2013 at 4:08 PM, Christopher Schultz <
[email protected]> wrote:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

André,



But, even when sending UTF-8 encoded data according to this
principle, they are *not* indicating that it is UTF-8 data, which
is basically wrong, because the standard HTTP/HTML character set is
iso-8859-1, and they *should* indicate it when that is not what
they are sending.  But that is the reality.
No, as much as it pains me to do so, I agree with with Mozilla folks
on this one: adding a charset attribute to an
application/x-form-urlencoded Content-Type violates the spec. There is
no good solution.
...



We really need an RFC for HTTP 2.0, with UTF-8 as the default
charset/encoding.
+1

Maybe they can clear-up Tomcat logging configuration while they are at
it :)


Thank you!  This discussion was quite informational.


You are welcome.

Further as relatively [OT], in some other - non-Tomcat, non-Java - applications, we solve the general issue as follows (taking into account the deficiencies of the RFCs, the servers, the browsers, and the users) : - when programmers create the html documents containing the forms, they must make sure that they use a tool which really saves the html document in the charset/encoding that corresponds to their wishes
- the html page should contain a declaration like :
<meta http-equiv="Content-Type" content="text/html; charset=xxxxx" />
(where xxxx is the correct charset/encoding, like "UTF-8")
- each form that is sent to the browser is sent by the server with an explicit HTTP header : Content-type: text/html; charset=xxxx
(that normally happens automatically, but you should nevertheless check that it 
matches)
- the <form> tag of the form should contain the "accept-charset" attribute with the expected character set as value, like
<form accept-charset="UTF-8" ...>
- the form itself contains a hidden parameter like :
<input type="hidden" name="charset-test" value="yyyyy">
(where yyyyy is a character sequence which is so that, seen as bytes, its length would be different depending on the real character set used. E.g., for Europe, "ÖöÜüÄä") - the application which receives the form submit data, must first check if the string received for the "charset-test" parameter matches its expectations. In other words, if the application expects UTF-8, then it should check that the received string has a byte length of 12 and a character length of 6, and matches a Unicode string "ÖöÜüÄä") And if it doesn't, then it should take appropriate action (abort the action, or try another charset) (if the form sent by the server contains additional data coming from a back-end database system, then one should also check that the charset of that data matches the one of the form of course).

This may look a bit like overkill, but it is the result of long and painful real-world experience with multi-lingual applications used with multiple browsers and multiple types of users in multiple countries doing cut-and-paste of all kinds of stuff into forms.





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to