This is the way it is supposed to work.  The default form submission
encoding is application/x-www-form-urlencoded (which you'll notice is what
got sent in the Content-Type header.  This means that all non-ASCII data is
going to get URL encoded using %HH (where H is single HEX digit).  Your
single input character got turned into Unicode and then encoded into UTF-8
which turned it into 3 bytes.  These three bytes where then URL encoded and
sent to the servlet.

You'll also notice that nothing in the POST request sent to the servlet
indicates the character encoding.  There is no way for the servlet container
to convert this data from the three bytes it receives back into characters
because nothing supplies the appropriate encoding.  This is not the fault of
the container, its a major hole in the HTTP and HTML specifications that
makes any I18n effort a royal pain in the a**.

There are a couple ways to decode the data but what I use is something like
this:

   sValue = new String(sOriginal.getBytes("8859_1"), sEncoding);

where sEncoding is the encoding used in the client (e.g. Shift_JIS).  You
can't determine sEncoding a proiori.  You'll need to either assume that all
data sent to your application is in a given encoding or pass the correct
encoding in a hidden form field, etc.



> -----Original Message-----
> From: Mike Spreitzer [mailto:[EMAIL PROTECTED]]
> Sent: Monday, February 19, 2001 4:27 PM
> To: [EMAIL PROTECTED]
> Subject: Shouldn't Tomcat 3.2.1 decode the UTF-8 encoding of request
> parameters?
>
>
> Consider a form that is encoded in UTF-8.  Here's how it comes down:
>
> HTTP/1.0 200 OK
> Content-Type: text/html; charset=UTF-8
> Servlet-Engine: Tomcat Web Server/3.2.1 (JSP 1.1; Servlet 2.2;
> Java 1.3.0;
> AIX 4.3 ppc; java.vendor=IBM Corporation)
>
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
>    "http://www.w3.org/TR/html4/DTD/loose.dtd">
> <html>
> ...
> <FORM METHOD=POST ACTION="/servlet/SusrReg">
> ...
> <INPUT NAME="usr" TYPE=text SIZE="20">
> ...
>
> I fill in the "usr" field with a single character, U+201D, and submit.
> Here's how the submission goes up:
>
> POST /servlet/SusrReg HTTP/1.1
> Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
> application/x-comet, application/pdf, */*
> Referer: http://9.2.43.70:8085/servlet/SusrReg
> Accept-Language: en-us
> Content-Type: application/x-www-form-urlencoded
> Accept-Encoding: gzip, deflate
> User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)
> Host: 9.2.43.70:8085
> Content-Length: 165
> Connection: Keep-Alive
> Cookie: JSESSIONID=loj2w5hcz1
>
> usr=%E2%80%9D&B1=Submit
>
> In my servlet, I find the value of the request parameter named "usr" is a
> string of three characters: U+00E2, U+0080, U+009D.  Should I be
> offended,
> or expect that the servlet should have to decode the UTF-8?  I find the
> servlet spec v2.2 fairly silent on the issue, leading me to expect that
> the servlet container is supposed to handle the full parameter decoding.
>
> Thanks,
> Mike
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, email: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, email: [EMAIL PROTECTED]

Reply via email to