Gregor Schneider wrote:
If found this one:

http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset

Actually, to me it's not clear why Tomcat should believe the input
being encoded in ISO8859-1, when one can give a detailled information
how the form-data is encoded.

If I understand it correctly, one can even *force* any client (as long
as the client is following the specs) to encode the form-data using
the "accepeted-charset"-attribute of the <Form>-element.

IOW:

Setting "accepted-charset="UTF8"" should solve the problems.

Comments, anyone?

Yes.
But no, it does not seem to work.
I was under the same impression as you indicate above, and I already knew about the <form accept-charset=..> But I just tested this in Firefox 2 and in IE 6, and it does not work as expected.

This is my test :

1) I created a html page as follows :
-- begin --
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<form name="f1" action="http://mira.wissensbank.com/pcgi/printenv.pl"; method="POST"
 enctype="multipart/form-data" accept-charset="UTF-8">
 First param: <input name="param1" type="text" value="andré"><br/>
 Second param: <input name="param2" type="text" value="gregör"><br/>
 <input name="go" type="submit" value="GO"><br/>
</form>
</body>
</html>
-- end --

The above file is created with a UTF-8 aware editor, and the characters in it (in "andré" and "gregör")(the umlaut is mine, as a test), are encoded as UTF-8. I saved the file as UTF-8 without BOM. As you can see, the document contains a <meta> tag indicating the page encoding, and the form contains an "accept-charset" attribute of the same color.

2) I opened this file in Firefox 2.0 and clicked the GO button.
Since I open this as a local file, there is no "Content-Type" header coming from the server to confuse things. In Firefox, I have the LiveHttpHeaders plugin installed, which allows me to see the request as sent to the server, and save a copy of it, which I did. This is the result :

-- begin --
POST /pcgi/printenv.pl HTTP/1.1
Host: mira.wissensbank.com
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.15) Gecko/20080623 Firefox/2.0.0.15 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-gb,en;q=0.7,de-de;q=0.3
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive
Content-Type: multipart/form-data; boundary=---------------------------218302158314236
Content-Length: 350
-----------------------------218302158314236
Content-Disposition: form-data; name="param1"

andré
-----------------------------218302158314236
Content-Disposition: form-data; name="param2"

gregör
-----------------------------218302158314236
Content-Disposition: form-data; name="go"

GO
-----------------------------218302158314236--
-- end --

3) I did the same in Internet Explorer 6.0, which has another plugin of similar functionality (Fiddler), with which I can capture the whole request.
Here it is :
-- begin --
POST /pcgi/printenv.pl HTTP/1.1
Accept: */*
Accept-Language: de
Content-Type: multipart/form-data; boundary=---------------------------7d98c5bb072c
Accept-Encoding: gzip, deflate
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30)
Host: mira.wissensbank.com
Content-Length: 338
Connection: Keep-Alive
Pragma: no-cache

-----------------------------7d98c5bb072c
Content-Disposition: form-data; name="param1"

andré
-----------------------------7d98c5bb072c
Content-Disposition: form-data; name="param2"

gregör
-----------------------------7d98c5bb072c
Content-Disposition: form-data; name="go"

GO
-----------------------------7d98c5bb072c--
-- end --

So, as anyone can see, neither one of these browsers is adding any charset information to the POST. Which I personally find very strange, and rather on the bad side of the HTTP specs.

Which tends to confirm the note in SRV 4.9 of the Servlet Specs 2.4/2.5 :
"Currently, many browsers do not send a char encoding qualifier with the Content-Type header, leaving open the determination of the character encoding for reading HTTP requests."

Which also seems to contradict the HTML specs which you mention :
http://www.w3.org/TR/html401/interact/forms.html#h-17.13
and following paragraphs. (Note by the way the "Note" at the end of 17.13.1)
In particular, this one from section "17.13.4 Form content types" :
As with all multipart MIME types, each part has an optional "Content-Type" header that defaults to "text/plain". User agents should supply the "Content-Type" header, accompanied by a "charset" parameter.

Well, Firefox 2.0 and IE 6.0 don't supply a "Content-Type" and even less a charset. In the case of IE 6.0, I am not really surprised, but in the case of Firefox, who would have thunk ?


Anyway, it kind of puts a spin on what I posted here before, in the sense that the servlet engine thus, even in the case of a html form which should have everything in it to leave no choice to the browser, still does not get any information about the real character set of the data sent by the browser.

Which personally, in our day and age, I find absolutely terrible.

I will now try to re-test this with Firefox 3 and IE 7.

Update : just tested with Firefox 3.1 beta, does not send Content-Type nor charset either.
I am puzzled as to why.


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to