-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 lightbulb,
lightbulb432 wrote: >> POST requests always use the request's "body" encoding, which is >> specified in the HTTP header (and can be overridden by using >> request.setCharacterEncoding). Some broken clients don't provide >> the character encoding of the request, which makes things difficult >> sometimes. > > What determines what's specified in the HTTP header for the value of the > encoding? Well... it's a bit of a chicken-in-an-egg scenario, since the encoding specified in the header must match the encoding actually used in the request. So, you could either decide that the header should match the content or the content should match the header. > Is it purely up to the user agent, or can Tomcat provide hints > based on previous requests how to encode it - or is it something up to the > end user to set in their browser (in IE, View -> Encoding)? Typically, the default encoding used by the user-agent will be locale-specific. For instance, most browsers in the US will use ISO-8859-1 as the default locale, or maybe WINDOWS-1252 if you're unlucky. Ideally, the server should be able to accept all reasonable encodings. The "Accept-Charset" header sent by the user-agent to the server indicates the acceptable encodings that should be returned, rated by acceptability. For instance, my en_US Mozilla Firefox on Windows sends this Accept-Charset string to servers: Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 This indicates that the browser would prefer ISO-8859-1 encoding, but will also accept UTF-8 as a second choice, but that anything will do ('*') if those two are unavailable. On HTML <form> elements, you may override the encoding used to send the data: <form accept-charset="UTF-8"> The HTML 4 specification says this about the accept-charset attribute: "The default value for this attribute is the reserved string "UNKNOWN". User agents may interpret this value as the character encoding that was used to transmit the document containing this FORM element." (http://www.w3.org/TR/html4/interact/forms.html#h-17.3) So, if the server sends a document using UTF-8, it is "polite" for the user-agent to use that same encoding to respond to the server if the server hasn't indicated any preference by using the accept-charset <form> attribute. > In what cases would you call request.setCharacterEncoding to override the > value specified by the user agent? You should only do this when the user-agent does not declare the charset being used in the body of the request through the Content-Type request header. You should also only do this when you are relatively confident that the user-agent is sending the data in the overridden character set. For instance, if you suspect that most browsers adhere to the W3C's recommendation above that an UNKNOWN accept-charset implies that the browser should respond to the server with the same charset as used in the previous server response (got all that?), and you always use the same charset to send pages (say, UTF-8), they it is reasonable to override any unspecified Content-Type encoding with the charset you use to send pages (UTF-8, in this case). The HTTP specification has this to say about missing charsets (in Content-Type headers): " The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP. Data in character sets other than "ISO-8859-1" or its subsets MUST be labeled with an appropriate charset value. See section 3.4.1 for compatibility problems." (http://www.ietf.org/rfc/rfc2616.txt Section 3.7.1) Basically, this says that a missing charset within a Content-Type header means that the request should be interpreted as being encoded using ISO-8859-1 encoding. Pretty simple. > Shouldn't you trust the user agent rather > than trying to guess? (Or is this only used in cases where the user agent is > "broken", like you said - but then how would you know you're dealing with a > broken client to begin with...aah, complicated!) You should /always/ respect the charset sent by the client. In fact, the HTTP spec says so: "HTTP/1.1 recipients MUST respect the charset label provided by the sender;" (http://www.ietf.org/rfc/rfc2616.txt Section 3.4.1) If the client sends the wrong charset, it's their fault that their data will get all screwed up. But, if there's no charset, then you should provide your own. The default charset should be ISO-8859-1. I think Tomcat uses the default encoding of the JVM if no charset is provided, which is a problem for folks who set the JVM encoding to UTF-8 for i18n purposes... because then the default becomes UTF-8 which is incorrect. Fortunately, UTF-8 and ISO-8859-1 are compatible for most common lower ASCII characters. This has lead to a lot of folks thinking that they have their servers configured correctly because it "looks like it works", but will fail for things such as accented Latin characters, etc. > What do you mean by this? Does it mean (pardon the surely messed up use of > the API below) in your response.addCookie(), you add a cookie where the > value has cookie.setValue(new String(charByteArray,"UTF-8")) then you read > it back using responseCookie.getValue().getBytes("UTF-8")? (Where UTF-8 is > whatever encoding you're using internally in your application.) Unless you are working with binary data, you shouldn't be using byte arrays: you should be using Strings. If you are putting binary data into a cookie, you should probably be encoding it using a reasonable binary-encoding scheme, such as base64, or even ascii-encoded-binary (0102030405060708090a0b0c0d0e... that kind of thing... not sure if that's an official term). HTML is always text, and your headers should not be in binary. If you check out how the WWW-Authenticate header works, you'll see that they use base64 to encode the binary data that is sent over the wire. Then, you don't have to worry about what charset you're using. The response object already knows what encoding to use and when. Don't forget that the HTTP headers are not part of the request or response. They are defined to be in ASCII, as far as I can tell. So, if you're using some odd charset like UTF-16, the headers are still expressed in good-old single-byte characters, even though the body will be using two-byte characters. > Finally, what's the default encoding used by the response when > response.setCharacterEncoding(myEncoding) isn't called? That depends. The server will pick an encoding that makes sense. I would imagine that if the client sent an Accept-Charset header that was compatible with the default encoding of the JVM, then that charset will be used. Other than that, I have no idea. > Am I correct to > assume that if that default is not the default Java String encoding of > UTF-16, then you MUST call convert all the Strings you've outputted to that > encoding? (...because the HTTP header expects whatever the default is, but > Java is outputting UTF-16 encoded text to the actual response bytes) Just to note, Java uses UTF-16 internally to store char values. That doesn't mean that it's the "default encoding" for Java. The default encoding for the JVM is, in fact, settable by the user. You can read that value from the system property "file.encoding". Tomcat (properly) uses java.io.Writer objects when writing character data to HTTP responses. Look at the javadoc for HttpServletRequest.getWriter(): http://tomcat.apache.org/tomcat-5.5-doc/servletapi/javax/servlet/ServletResponse.html#getWriter() "Returns a PrintWriter object that can send character text to the client. The PrintWriter uses the character encoding returned by getCharacterEncoding(). If the response's character encoding has not been specified as described in getCharacterEncoding (i.e., the method just returns the default value ISO-8859-1), getWriter updates it to ISO-8859-1." So, the servlet specification sets the default character set to ISO-8859-1, which is inconvenient for users of non-Latin character sets. That means that, if you want to use something else, you should set the character encoding /before/ any call to getWriter occurs. I recommend UTF-8 as I think it should cover all unicode characters but also uses fewer bytes when you are sending regular Latin characters, which is nice. > P.S. How did you learn all of that?! Experience. Most of the references I just looked up on the spot, because I know where to find them. I don't have all those quotes in my brain ;) - -chris -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGkl9h9CaO5/Lv0PARAmgWAJ9nq0dDw8HUksc5TCDh5odprw858wCgq9OY FxtYQxqzuqjwm/OsKm2mvAM= =1zKK -----END PGP SIGNATURE----- --------------------------------------------------------------------- To start a new topic, e-mail: users@tomcat.apache.org To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]