On 29 May 2004, at 16:11, Antonio Gallardo wrote:

I think most of us are using servlet containers with servlet specs 2.3 or
superior. In that way, I think it is time to move to a higher servlet API
specs? I think just this little things are enough.

I've been doing i18n work on Servlets for a _very_ long time and, dude, I've never seen a problem with the API ever...

Let's split the problem in three parts: headers and body and URLs:

--------
HEADERS:
--------

Now, the HTTP spec defines that a header needs to follow the RFC-822 section 3.1 specification, therefore (I'm going on memory here, not cross checking) the header name must be composed of only a strict subset of US-ASCII characters, and the header value can ONLY be made up of ISO-88559-1 characters.

No problemo here...

At around page 16 of the RFC-2616 Roy also mentions that IF you want to encode something in headers that IS NOT encodable in ISO-8859-1, you gotta follow RFC-2047 (Mime Part 3) which defines clearly how such values are encoded...

Now, when we do a setHeader in the response, or do a getHeader from the request, the servlet container SHOULD parse/encode out the values in the correct way, although I've never seen any of them doing it (they simply ignore the whole shabang and use ISO-8859-1 for both header names and values and don't do any additional parsing/engoding.

Bug in the servlet containers...

-----
BODY:
-----

RFC-2616 is _very_ clear at this point, if you don't specify the charset token in the "Content-Type" header, and you specify (or imply) that the body is "text/something" you SHOULD assume that you're receiving / sending text encoded in ISO-8859-1...

Again, I seriously don't think that servlet containers check for the encoding of the request body when the content type is "application/x-www-form-urlencoded", because I _suppose_ that given that it doesn't start with "text/..." they ignore the whole shabang...

So, I believe that in some cases, the encoding of parameters returned by servlet containers MIGHT be wrong (but I ain't sure, haven't checked that lately).

When you send, on the other hand, the servlet API doesn't have much functionalities until 2.4 to set the charset encoding of the response, but that _really_ affected only stupid JSPs which were never though right anyway...

In Cocoon (I hope) we should never rely on the "getWriter()" returned by the servlet container but ALWAYS use a "getOutputStream()" and set ALWAYS the content type with the proper "charset" token...

If we don't we're kinda violating 3.4.1 of RFC-2616 as it says that one SHOULD always put the charset in there (if relevant, of course).

So, the problem is only in reading parameters, and that should be fixed at the servlet container level.

----
URL:
----

URLs are important as sometimes the request parameters are passed as query string attached to them...

Initially they were defined on US-ASCII and/or ISO-8859-1 (can't remember which one exactly) and that all non-printable characters had to be encoded with the usual percent-number-number format...

Great...

Between the W3C and RFC-2718 someone decided (at the end of the whole discussion) that URLs, in their internationalizable format only had to change in one aspect: the character encoding.

So, an URL nowadays (tested on my girlfriend's Jappo-Internet-Explorer) are sequences of bytes representing a string encoded in UTF-8, and the same rule applies of encoding the characters outside of the originally-defined printable ones with the usual percent-number-number re-encoding...

Again, I seriously don't think that any servlet container does this check, so, if we get wrong request parameters when someone browses in Japanese and posts a GET form, is not our fault...

-----------
CONCLUSION:
-----------

I believe Jon Postel once said "be strict in what you send, be liberal in what you accept" and this principle has been forgotten by the servlet-container implementors...

We can be strict as much as we can by sending the right stuff (as the servlet API allows us to do it by using OutputStream(s) instead of Writer), but we cannot be liberal in what we accept as URLs and request parameters are already pre-parsed for us into nice unicode-based Java String(s).

As far as I can see (and by the "trick" you outlined)

new String(value.getBytes("8859_1"), "utf-8")

servlet containers simply ignore that there's a world out there that DOES NOT speak english, and cut shortcuts to increase their parsing speed...

Unfortunately, there's not much we can do (apart from brutal hacks like the one mentioned above) to get parameters from my girlfriend's Jappo-browser.

One thing we could do, though, is to make sure that the communities building our servlet container of choice are aware of those problems, so, rather than reinventing hacks in Cocoon, I'd say, post those issues as bugs for Tomcat and Jetty and let them sort out the whole mess...

It ain't our fault, and unfortunately, we can only properly fix only one side of the story, what we send...

Pier

Attachment: smime.p7s
Description: S/MIME cryptographic signature



Reply via email to