Joseph Millet wrote:
Maybe I'm missing something but from the little knowledge I have, I'd
think an HTML form is posted encoded in the form enclosing HTML
document charset specified in the sent Server headers. So that you
settle a page encoded in iso-8859-2, you wouldn't expect a form
present in that page to post unicode data, would you ?

Maybe we need to restate the issue a bit differently.
Imagine a website on which there is a starting page with 3 links :
- formA.html
- formB.html
- formC.html
Each of these is a html page containing a tag '<form method="POST">'.
Now 3 users, each at his workstation, obtain this starting page from the server. Then userA clicks on the link to formA.html and obtains the corresponding page.
Similarly, userB clicks on the second link etc..
The users fill in their respective forms, and submit their respective forms to the server (in any order).

The process on the server which handles the first submission (whether it is a servlet in Tomcat, or a cgi-bin under httpd etc.. doesn't matter), has no idea where this submit data comes from, right ? (It could even come from a page obtained from another server). So the process in question has to evaluate this data, based only on what it gets in this specific POST.

What we are discussing here is how, based only on the data coming in from the browser POST, the server process determines the correct character encoding of what it receives. And the answer so far is, it basically cannot be sure, because the browser does not send enough information with the POST, to allow the server process to determine this unambiguously.

Of course, if the server process is sure that the form originally came from itself, and that all the forms composing this application are defined such that the browser *should* always encode the data in a specific way, then the process could reasonably assume a charset and encoding. But if one of the users uses a non-compliant browser that does not give a jot about what html is telling it to do, then ..

A separate but connected question is that it seems that current browsers do not follow entirely the HTML specifications, and even for multipart/form-data submissions, do not send the charset/encoding headers that would enable the server to know for sure, athough they should.

To go back to your note above :
It is true that the browser, in the absence of other information, SHOULD consider that the data it is going to submit should be in the encoding of the page containing the <form>. This /can/ be changed by using the "accept-charset" attribute of the <form> tag. However, even if that is true and if the browser follows the specifications in that respect and does encode the data properly, it does not change what I mention above about the fact that the server is still really in the dark about what it gets.



---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to