Re: [OT] Re: Character encoding

Christopher Schultz Mon, 09 Jul 2007 09:17:19 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

lightbulb,

lightbulb432 wrote:
>> POST requests always use the request's "body" encoding, which is
>> specified in the HTTP header (and can be overridden by using 
>> request.setCharacterEncoding). Some broken clients don't provide 
>> the character encoding of the request, which makes things difficult
>> sometimes.
> 
> What determines what's specified in the HTTP header for the value of the
> encoding?

Well... it's a bit of a chicken-in-an-egg scenario, since the encoding
specified in the header must match the encoding actually used in the
request. So, you could either decide that the header should match the
content or the content should match the header.

> Is it purely up to the user agent, or can Tomcat provide hints
> based on previous requests how to encode it - or is it something up to the
> end user to set in their browser (in IE, View -> Encoding)?

Typically, the default encoding used by the user-agent will be
locale-specific. For instance, most browsers in the US will use
ISO-8859-1 as the default locale, or maybe WINDOWS-1252 if you're
unlucky. Ideally, the server should be able to accept all reasonable
encodings. The "Accept-Charset" header sent by the user-agent to the
server indicates the acceptable encodings that should be returned, rated
by acceptability. For instance, my en_US Mozilla Firefox on Windows
sends this Accept-Charset string to servers:

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7

This indicates that the browser would prefer ISO-8859-1 encoding, but
will also accept UTF-8 as a second choice, but that anything will do
('*') if those two are unavailable.

On HTML <form> elements, you may override the encoding used to send the
data:

<form accept-charset="UTF-8">

The HTML 4 specification says this about the accept-charset attribute:
"The default value for this attribute is the reserved string "UNKNOWN".
User agents may interpret this value as the character encoding that was
used to transmit the document containing this FORM element."
(http://www.w3.org/TR/html4/interact/forms.html#h-17.3)

So, if the server sends a document using UTF-8, it is "polite" for the
user-agent to use that same encoding to respond to the server if the
server hasn't indicated any preference by using the accept-charset
<form> attribute.

> In what cases would you call request.setCharacterEncoding to override the
> value specified by the user agent?

You should only do this when the user-agent does not declare the charset
being used in the body of the request through the Content-Type request
header. You should also only do this when you are relatively confident
that the user-agent is sending the data in the overridden character set.

For instance, if you suspect that most browsers adhere to the W3C's
recommendation above that an UNKNOWN accept-charset implies that the
browser should respond to the server with the same charset as used in
the previous server response (got all that?), and you always use the
same charset to send pages (say, UTF-8), they it is reasonable to
override any unspecified Content-Type encoding with the charset you use
to send pages (UTF-8, in this case).

The HTTP specification has this to say about missing charsets (in
Content-Type headers):
"  The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP. Data in character sets other than "ISO-8859-1" or
   its subsets MUST be labeled with an appropriate charset value. See
   section 3.4.1 for compatibility problems."
(http://www.ietf.org/rfc/rfc2616.txt Section 3.7.1)

Basically, this says that a missing charset within a Content-Type header
means that the request should be interpreted as being encoded using
ISO-8859-1 encoding. Pretty simple.

> Shouldn't you trust the user agent rather
> than trying to guess? (Or is this only used in cases where the user agent is
> "broken", like you said - but then how would you know you're dealing with a
> broken client to begin with...aah, complicated!)

You should /always/ respect the charset sent by the client. In fact, the
HTTP spec says so:
"HTTP/1.1 recipients MUST respect the charset label provided by the sender;"
(http://www.ietf.org/rfc/rfc2616.txt Section 3.4.1)

If the client sends the wrong charset, it's their fault that their data
will get all screwed up.

But, if there's no charset, then you should provide your own. The
default charset should be ISO-8859-1. I think Tomcat uses the default
encoding of the JVM if no charset is provided, which is a problem for
folks who set the JVM encoding to UTF-8 for i18n purposes... because
then the default becomes UTF-8 which is incorrect. Fortunately, UTF-8
and ISO-8859-1 are compatible for most common lower ASCII characters.
This has lead to a lot of folks thinking that they have their servers
configured correctly because it "looks like it works", but will fail for
things such as accented Latin characters, etc.

> What do you mean by this? Does it mean (pardon the surely messed up use of
> the API below) in your response.addCookie(), you add a cookie where the
> value has cookie.setValue(new String(charByteArray,"UTF-8")) then you read
> it back using responseCookie.getValue().getBytes("UTF-8")? (Where UTF-8 is
> whatever encoding you're using internally in your application.)

Unless you are working with binary data, you shouldn't be using byte
arrays: you should be using Strings. If you are putting binary data into
a cookie, you should probably be encoding it using a reasonable
binary-encoding scheme, such as base64, or even ascii-encoded-binary
(0102030405060708090a0b0c0d0e... that kind of thing... not sure if
that's an official term). HTML is always text, and your headers should
not be in binary. If you check out how the WWW-Authenticate header
works, you'll see that they use base64 to encode the binary data that is
sent over the wire. Then, you don't have to worry about what charset
you're using. The response object already knows what encoding to use and
when.

Don't forget that the HTTP headers are not part of the request or
response. They are defined to be in ASCII, as far as I can tell. So, if
you're using some odd charset like UTF-16, the headers are still
expressed in good-old single-byte characters, even though the body will
be using two-byte characters.

> Finally, what's the default encoding used by the response when
> response.setCharacterEncoding(myEncoding) isn't called?

That depends. The server will pick an encoding that makes sense. I would
imagine that if the client sent an Accept-Charset header that was
compatible with the default encoding of the JVM, then that charset will
be used. Other than that, I have no idea.

> Am I correct to
> assume that if that default is not the default Java String encoding of
> UTF-16, then you MUST call convert all the Strings you've outputted to that
> encoding? (...because the HTTP header expects whatever the default is, but
> Java is outputting UTF-16 encoded text to the actual response bytes)

Just to note, Java uses UTF-16 internally to store char values. That
doesn't mean that it's the "default encoding" for Java. The default
encoding for the JVM is, in fact, settable by the user. You can read
that value from the system property "file.encoding".

Tomcat (properly) uses java.io.Writer objects when writing character
data to HTTP responses. Look at the javadoc for
HttpServletRequest.getWriter():

http://tomcat.apache.org/tomcat-5.5-doc/servletapi/javax/servlet/ServletResponse.html#getWriter()

"Returns a PrintWriter object that can send character text to the
client. The PrintWriter uses the character encoding returned by
getCharacterEncoding(). If the response's character encoding has not
been specified as described in getCharacterEncoding  (i.e., the method
just returns the default value ISO-8859-1), getWriter  updates it to
ISO-8859-1."

So, the servlet specification sets the default character set to
ISO-8859-1, which is inconvenient for users of non-Latin character sets.
That means that, if you want to use something else, you should set the
character encoding /before/ any call to getWriter occurs. I recommend
UTF-8 as I think it should cover all unicode characters but also uses
fewer bytes when you are sending regular Latin characters, which is nice.

> P.S. How did you learn all of that?!

Experience. Most of the references I just looked up on the spot, because
I know where to find them. I don't have all those quotes in my brain ;)

- -chris

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFGkl9h9CaO5/Lv0PARAmgWAJ9nq0dDw8HUksc5TCDh5odprw858wCgq9OY
FxtYQxqzuqjwm/OsKm2mvAM=
=1zKK
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [OT] Re: Character encoding

Reply via email to