[RT] About charsets (character encoding) and servlet API

Antonio Gallardo Sat, 29 May 2004 08:12:05 -0700

Hi:

Not sure if this is part of the Carsten RT about the new Cocoon version. I
need to say I am far to be an authority in this area, but, I think this
time we need to discuss it:


Introduction
============

It is a fact, the world is moving to UTF-8. In many of the new development
requirements, there are words related to i18n and support for foreign
languages. Cocoon cannot stay out of this. The current approach inside
cocoon to manage the i18n is not going no where. People is still having
problem when trying to use UTF-8 inside the applications. And it is
becoming to be a strong lack of functionality. If we make a simple search
in mail archives for the keyword UTF-8 you found 2268 mails! It will grow
faster than ever if we don't solve this problem.

Note, I am aware of the i18n samples in Cocoon and the currents efforts
about how to solve it are being documented in:

http://wiki.cocoondev.org/Wiki.jsp?page=RequestParameterEncoding

But still there is a question in my mind:

Why we need to make a rocket science about this triviality?

I am also asking myself if the problem is a bug in Tomcat or to the
servlet API we are currenlty using?

The follow lines will explain why?

Revisiting the servlet API 2.2
==============================

Final Release: December 17th, 1999.

Cocoon for severals years (not sure if since his firsts days) is using the
servlet specification 2.2. The 2.2 servlet specification don't have a
clear policy about how to manage the i18n problem. Instead it let the
problem to the servlet developers (in this case, cocoon developers). And
this is why I am not sure if we can rant to Tomcat or other servlets
containers. I guess this is mainly why Tomcat is doing nothing about that.
We use Servlet API 2.2.

Facts:

1-In the Servlet API 2.2, the methods that parse parameter input ALWAYS
assume that it's sent as ISO 8859-1 (getParameter() etc).

2- ISO-8859-1 is the default encoding of HTTP!

                              - 0 -

In that way if we send characters, says in UTF-8. It create a String
containing the correct bytes but incorrect charset!

And this is why Cocoon needs a hack (or a "fix") to convert the bytes to a
string using the correct charset. Bruno showed me that we use something
like:

new String(value.getBytes("8859_1"), "utf-8")

Knowing the above facts, let me describe what happen when the browser
sends parameters in UTF-8:

The browser encodes each character byte value as a hexadecimal string
using the encoding for the page (in this example, UTF-8).

Then server (the servlet container) interprets these character values and
always assumes they are 8859-1 byte values! So it creates a Unicode string
based on the byte values interpreted as 8859-1. Since the 8859-1
assumption is made by the container, the Cocon hack (or "fix") is needed
independently of the platform we run it on.

But is is rocket science! Mainly when we already have new servlets APIs
that allow to manage it in a more elegant way....


Moving to a new servlet specification?
======================================

Reading about that I found:

Since servlet specification 2.3 (Final Release August 13th, 2001), Sun
started to solve some of the problems related to this topic. This API
provides the support for handling foreign language form submittals. In API
2.3 we can tell to the server the request's character encoding desired
using the method:

request.setCharacterEncoding(String encoding).

So to retrieve UTF-8 parameters we can simple use:

req.setCharacterEncoding("UTF-8");      // Set the charset to UTF-8
String name = req.getParameter("name"); // Read the parameter

This is great, right? Lets see what we have now....

Servlet API 2.4
===============

Reading I found:

Introduced in November 24, 2003
Minimum J2SE required: 1.3

This API, the ServletResponse interface (and the ServletResponseWrapper)
add a new method interesting method to us:

1- setCharacterEncoding(String encoding): Sets the response's character
encoding. This method provides an alternative to passing a charset
parameter to setContentType(String) or passing a Locale to
setLocale(Locale).

With this method, we can avoid setting the charset using
setContentType("text/html; charset=UTF-8") call.

Servlet 2.4 also introduces a new <locale-encoding-mapping-list> element
in the web.xml  deployment descriptor to let the deployer assign
locale-to-charset mappings outside servlet code. It looks like this:

<locale-encoding-mapping-list>
  <locale-encoding-mapping>
    <locale>ja</locale>
    <encoding>Shift_JIS</encoding>
  </locale-encoding-mapping>
  <locale-encoding-mapping>
    <locale>zh_TW</locale>
    <encoding>Big5</encoding>
  </locale-encoding-mapping>
</locale-encoding-mapping-list>

Now within this Web application, any response assigned to the ja locale
uses the Shift_JIS charset, and any assigned to the zh_TW Chinese/Taiwan
locale uses the Big5 charset. These values could later be changed to UTF-8
when it grows more popular among clients. Any locales not mentioned in the
list will use the container-specific defaults as before.

Conclusion
==========

I think most of us are using servlet containers with servlet specs 2.3 or
superior. In that way, I think it is time to move to a higher servlet API
specs? I think just this little things are enough.

Please tell me WDYT?

Best Regards,

Antonio Gallardo

Further reading:

[1] "Servlet 2.3: New features exposed" -
http://www.javaworld.com/javaworld/jw-01-2001/jw-0126-servletapi.html

[2] "Servlet 2.4: What's in store" -
http://www.javaworld.com/javaworld/jw-03-2003/jw-0328-servlet.html

[RT] About charsets (character encoding) and servlet API

Reply via email to