Hi: Not sure if this is part of the Carsten RT about the new Cocoon version. I need to say I am far to be an authority in this area, but, I think this time we need to discuss it:
Introduction ============ It is a fact, the world is moving to UTF-8. In many of the new development requirements, there are words related to i18n and support for foreign languages. Cocoon cannot stay out of this. The current approach inside cocoon to manage the i18n is not going no where. People is still having problem when trying to use UTF-8 inside the applications. And it is becoming to be a strong lack of functionality. If we make a simple search in mail archives for the keyword UTF-8 you found 2268 mails! It will grow faster than ever if we don't solve this problem. Note, I am aware of the i18n samples in Cocoon and the currents efforts about how to solve it are being documented in: http://wiki.cocoondev.org/Wiki.jsp?page=RequestParameterEncoding But still there is a question in my mind: Why we need to make a rocket science about this triviality? I am also asking myself if the problem is a bug in Tomcat or to the servlet API we are currenlty using? The follow lines will explain why? Revisiting the servlet API 2.2 ============================== Final Release: December 17th, 1999. Cocoon for severals years (not sure if since his firsts days) is using the servlet specification 2.2. The 2.2 servlet specification don't have a clear policy about how to manage the i18n problem. Instead it let the problem to the servlet developers (in this case, cocoon developers). And this is why I am not sure if we can rant to Tomcat or other servlets containers. I guess this is mainly why Tomcat is doing nothing about that. We use Servlet API 2.2. Facts: 1-In the Servlet API 2.2, the methods that parse parameter input ALWAYS assume that it's sent as ISO 8859-1 (getParameter() etc). 2- ISO-8859-1 is the default encoding of HTTP! - 0 - In that way if we send characters, says in UTF-8. It create a String containing the correct bytes but incorrect charset! And this is why Cocoon needs a hack (or a "fix") to convert the bytes to a string using the correct charset. Bruno showed me that we use something like: new String(value.getBytes("8859_1"), "utf-8") Knowing the above facts, let me describe what happen when the browser sends parameters in UTF-8: The browser encodes each character byte value as a hexadecimal string using the encoding for the page (in this example, UTF-8). Then server (the servlet container) interprets these character values and always assumes they are 8859-1 byte values! So it creates a Unicode string based on the byte values interpreted as 8859-1. Since the 8859-1 assumption is made by the container, the Cocon hack (or "fix") is needed independently of the platform we run it on. But is is rocket science! Mainly when we already have new servlets APIs that allow to manage it in a more elegant way.... Moving to a new servlet specification? ====================================== Reading about that I found: Since servlet specification 2.3 (Final Release August 13th, 2001), Sun started to solve some of the problems related to this topic. This API provides the support for handling foreign language form submittals. In API 2.3 we can tell to the server the request's character encoding desired using the method: request.setCharacterEncoding(String encoding). So to retrieve UTF-8 parameters we can simple use: req.setCharacterEncoding("UTF-8"); // Set the charset to UTF-8 String name = req.getParameter("name"); // Read the parameter This is great, right? Lets see what we have now.... Servlet API 2.4 =============== Reading I found: Introduced in November 24, 2003 Minimum J2SE required: 1.3 This API, the ServletResponse interface (and the ServletResponseWrapper) add a new method interesting method to us: 1- setCharacterEncoding(String encoding): Sets the response's character encoding. This method provides an alternative to passing a charset parameter to setContentType(String) or passing a Locale to setLocale(Locale). With this method, we can avoid setting the charset using setContentType("text/html; charset=UTF-8") call. Servlet 2.4 also introduces a new <locale-encoding-mapping-list> element in the web.xml deployment descriptor to let the deployer assign locale-to-charset mappings outside servlet code. It looks like this: <locale-encoding-mapping-list> <locale-encoding-mapping> <locale>ja</locale> <encoding>Shift_JIS</encoding> </locale-encoding-mapping> <locale-encoding-mapping> <locale>zh_TW</locale> <encoding>Big5</encoding> </locale-encoding-mapping> </locale-encoding-mapping-list> Now within this Web application, any response assigned to the ja locale uses the Shift_JIS charset, and any assigned to the zh_TW Chinese/Taiwan locale uses the Big5 charset. These values could later be changed to UTF-8 when it grows more popular among clients. Any locales not mentioned in the list will use the container-specific defaults as before. Conclusion ========== I think most of us are using servlet containers with servlet specs 2.3 or superior. In that way, I think it is time to move to a higher servlet API specs? I think just this little things are enough. Please tell me WDYT? Best Regards, Antonio Gallardo Further reading: [1] "Servlet 2.3: New features exposed" - http://www.javaworld.com/javaworld/jw-01-2001/jw-0126-servletapi.html [2] "Servlet 2.4: What's in store" - http://www.javaworld.com/javaworld/jw-03-2003/jw-0328-servlet.html
