On Sat, 12 May 2001, Alec Yu wrote:
> I read some code in catalina & jasper, and found that: There is a
> setCharacterEncoding() for servlet request now; but I greped all
> Tomcat code, and found nowhere called it. It means, by default, Tomcat
> use a default encoding of '8859_1'. There is no option in
> server.xml/web.xml for tomcat to set a default encoding for a
> context/container(or whatever) to use a default encoding other than
> '8859_1'.
>
Servlet Specification 2.3 (Proposed Final Draft 2), Section 5.4 (p. 44):
'The default encoding of a response is "ISO-8859-1"
if none has been specified by the servlet programmer.'
Providing container-level overrides for this would seem to break the spec,
and any application that depended on that features would not be portable
to other containers.
> Also, the alternative (JSP compiling) encoding option in conf/web.xml
> for jasper seems failed to work (at least, failed for JSP pages in
> big5 encoding). When there is no '<% page contentType="text/html;
> charset=xxx" %>' in a JSP, jasper use '8859_1' as its the JSP's
> default encoding, oops.
>
Again, this is a spec requirement. This time it's JSP 1.2 (Proposed Final
Draft 2), Section 2.10.1 (p. 52):
'The CHARSET value of contentType is used as default if
present, or ISO-8859-1 otherwise.'
> We are working on a product deploying JSP pages which targeting
> multiple markets in Japan, Taiwan, and probably China mainland. Sure,
> when we maintain our JSP pages (initially show messages in english,
> but should be able to handle input in localized character encodings),
> we don't like to maintain 3 versions of JSP pages with each version of
> them differed only in the page directive: '<% page
> contentType="text/html; charset=xxx" %>'
>
In JSP 1.2, there is one new feature that can help in this situation. You
can set the content type dynamically in a scriptlet or custom tag, as long
as the response has not yet been committed. See the overall page
lifecycle discussion in Section 2.7.
>
> And, I found Tomcat does byte->char typecast first and then char->byte
> typecast back before converting bytes into a java string.
> Unfortunately, because the character encoding is never changed from
> '8859_1' to some other customized one assigned in somewhere other than
> in code.
>
Are you talking about the output character encoding sent to the browser?
You can set that (along with the content type) by calling
response.setContentType("text/html; charset=xxxxx");
as long as this is done before the first buffer-full is flushed.
> This seems to work at first, as long as you don't treat strings read
> from GET/POST parameters as Unicode strings, because they are NOT
> VALID UNICODE STRINGS. Web output generated from servlets/JSP pages
> may be right, simply because contents in these NOT VALID UNICODE
> STRINGS are converted into bytes again by simply doing char->byte
> typecasting.
>
For GET requests, there are not very many good solutions because the
request itself does not include information about the character encoding
that was used on the requset URI.
For POST requests, the request parameters will be parsed in the character
encoding specified by the browser (as part of the content type
header). If they did not, a new feature in Servlet 2.3 lets you call
request.setCharacterEncoding() before trying to read any request
parameters, if the app knows what character encoding was used.
> Oops! It goes too far. People can't just do
> internalization/localization in such a "garbage in garbage out"
> solution. Maybe it looks right both in the input/output ends, if you
> simply GET/POST something and out.println(xxx.getParameter("foo")).
> But if you are doing something serious with character encodings other
> than 8859_1 (if Big5, GB2312 and Shift_JIS are for localization and
> not serious enough, how about utf-8 character encoding? indeed, Tomcat
> garbaged GET/POST inputs in utf-8 encoding), you must handle this
> problem.
>
> Personally, I code my own connector to aim this problem. The connector
> works as a bridge from Sun's Brazil web server (a light-weight web
> server in 100% java), Brazil HTTP request objects are passed directly
> into the connector (rather than via some socket protocl), such that
> the connector does configure servlets/JSP pages to use a default
> encoding given by properties set in the Brazil configuration file, and
> it does URL encoding check against raw strings input in GET/POST
> parameters in localized character encoding, as to make sure Tomcat
> does right character conversions for these parameters. (the %xx URL
> decoding code in parseParameters() in Tomcat 4 beta 3/4 works fine,
> but the byte->char/char->byte code drops some characters) But there is
> no way to modify jasper's default compiling encoding, except modify
> its code.
>
Could you point me specifically to the byte->char/char->byte code that you
are concerned about?
You are obviously free to do this kind of special connector, and/or modify
Tomcat to meet your needs -- but you're also making yourself dependent on
conventions that are contrary to the servlet and JSP specifications. Any
apps you write that depend on this behavior won't run on any other servers
that implement the standards. You might want to look at standards based
alternatives to at least some of the issues that you have raised.
Craig McClanahan