[OT] Re: Character encoding

lightbulb432 Sun, 08 Jul 2007 20:57:34 -0700

That was a really great set of answers, thanks! These follow-ups are somewhat
off-topic to Tomcat, but you really know this stuff well so I hope you don't
mind addressing them:



POST requests always use the request's "body" encoding, which is specified
in 
> the HTTP header (and can be overridden by using 
> request.setCharacterEncoding). Some broken clients don't provide the 
> character encoding of the request, which makes things difficult sometimes.

What determines what's specified in the HTTP header for the value of the
encoding? Is it purely up to the user agent, or can Tomcat provide hints
based on previous requests how to encode it - or is it something up to the
end user to set in their browser (in IE, View -> Encoding)?

In what cases would you call request.setCharacterEncoding to override the
value specified by the user agent? Shouldn't you trust the user agent rather
than trying to guess? (Or is this only used in cases where the user agent is
"broken", like you said - but then how would you know you're dealing with a
broken client to begin with...aah, complicated!)



You shouldn't have to worry about cookie encoding, since you can always
> call request.getCookies() and get them "correctly" interpreted for you.

What do you mean by this? Does it mean (pardon the surely messed up use of
the API below) in your response.addCookie(), you add a cookie where the
value has cookie.setValue(new String(charByteArray,"UTF-8")) then you read
it back using responseCookie.getValue().getBytes("UTF-8")? (Where UTF-8 is
whatever encoding you're using internally in your application.)


Finally, what's the default encoding used by the response when
response.setCharacterEncoding(myEncoding) isn't called? Am I correct to
assume that if that default is not the default Java String encoding of
UTF-16, then you MUST call convert all the Strings you've outputted to that
encoding? (...because the HTTP header expects whatever the default is, but
Java is outputting UTF-16 encoded text to the actual response bytes)

Am I speaking rubbish here, or am I thinking about these concepts in the
right way?

Thanks a lot.

P.S. How did you learn all of that?!




Christopher Schultz-2 wrote:
> 
> Lightbulb,
> 
> lightbulb432 wrote:
>> Why is the URIEncoding attribute specified on the connector rather than
>> on a
>> host, for example?
> 
> Because the host doesn't handle connections... the connectors do.
> 
>> Does this mean that the number of virtual hosts that can
>> listen on the same port on the same box are limited by whether they all
>> use
>> the same encodings in their URIs?
> 
> Yes, all virtual hosts listening on the same port will have to have the
> same encoding. Fortunately, UTF-8 works for all languages that I know of.
> 
>> Now that I think about it, wouldn't it be
>> at the context level, not even at the host level?
> 
> If you had a connector-per-context, yes, but that's no the case.
> 
>> In Tomcat 6, should the useBodyEncodingForURI be used if not needing
>> compatibility with 4.1, as the documentation mentions? 
> 
> I would highly recommend following that recommendation.
> 
>> To see if I have things straight, is HttpServletRequest's
>> get/setCharacterEncoding used for both the request parameters from a GET
>> request AND the contents of the POST?
> 
> No. GET requests have request parameters encoded as part of the URL,
> which is affected by the <Connector>'s URIEncoding parameter. POST
> requests always use the request's "body" encoding, which is specified in
> the HTTP header (and can be overridden by using
> request.setCharacterEncoding). Some broken clients don't provide the
> character encoding of the request, which makes things difficult sometimes.
> 
>> How are multipart POST requests dealt with?
> 
> Typically, each part of a multipart request contains its own character
> encoding, so a multipart POST would follow the encoding for the part
> you're reading at the time.
> 
>> And HttpServletResponse's get/setCharacterEncoding is used for the
>> contents
>> of the response header and the meta tags?
> 
> Only for the header field, not META tags. If you want to emit META tags,
> you'll have to do them yourself.
> 
>> Does it also encode the page content itself? 
> 
> Nope. If you change the character encoding for a response after the
> response has already had some data written to it, I think you'll send an
> incorrect header. For instance:
> 
> response.setCharacterEncoding("ISO-8859-1");
> PrintWriter out = response.getOutputWriter();
> 
> response.setCharacterEncoding("Big5");
> 
> out.print("abcdef");
> out.flush();
> 
> Your client will not receive a sane response. Setting the character
> encoding only sets the HTTP response header and configures the
> response's Writer, if used, but only /before/ calling getWriter the
> first time.
> 
>> What about the encoding of cookies for both incoming requests and
>> outgoing
>> responses?
> 
> See the HTTP spec, section 4.2 ("Message Headers"). It references RFC
> 822 (ARPA Internet text messages) which does not actually specify a
> character encoding. From what I can see, low ASCII is the encoding used.
> You shouldn't have to worry about cookie encoding, since you can always
> call request.getCookies() and get them "correctly" interpreted for you.
> 
> -chris
> 
> 
>  
> 

-- 
View this message in context: 
http://www.nabble.com/Character-encoding-tf4031134.html#a11495606
Sent from the Tomcat - User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To start a new topic, e-mail: users@tomcat.apache.org
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[OT] Re: Character encoding

Reply via email to