On 17 Aug 2004, at 16:20, Marc Portier wrote:

How about setting it up as the default behavior for Cocoon's internal Jetty distro?

makes sense, but: (whishing all this brokenness wan't there but helas)

It's not really "brokenness" but more along the lines of an inversion of the Robustness Principle, as outlined by J. Postel in RFC-791 (http://www.rfc-editor.org/rfc/rfc791.txt section 3.2) and later dogmatized by R. Braden in RFC-1122 (http://www.rfc-editor.org/rfc/rfc1122.txt Section 1.2.2).

"Be liberal in what you accept, and conservative in what you send."

In this case browsers are liberal in what they send (URL-Encoded UTF-8) and servlet containers are conservative in what they accept (URL-Encoded ISO-8859-1).

- it shouldn't keep us from actually get about solving it for all
containers? (my guess is that just a fraction of cocoon deployments
actually run on the internal jetty distro, i.e. using the cocoon.sh or
.bat?)

Well, we found that Jetty in production was much better than anyone else. So, in our production environment we have Jetty (not the Cocoon distro one, a full blown copy)... Works pretty neatly! :-P

- learning about this org.mortbay.util.URI.charset property we should
probably use it to override (or at least log-warn deployers if it's
different to) the container-encoding setting in the web.xml
(assuming that the mentioned property will also be in effect when
decoding the request parameters, and taking in account that current
cocoon code assumes ISO-8859-1 as the default there)

I agree, but as I said, my world revolves around the best container in the world (whops, Jetty), so I already have "my" fix to the problem: switch! :-P

- once we've run that far, we might even consider making a scan of other
servlet containers and how they possibly allow setting the
container-encoding?

The "conteiner-encoding" servlet initialization parameter simply applies for request parameters (form data), and I suppose it only affects how the way in which from the ServletRequest.getInputStream() we read full blown characters, and parse forms.

while typing I started rethinking why we ended up with this
container-encoding init-param in web.xml?

IIRC we did that because of required compliance to servlet spec versions
prior to 2.3? So first question is are we still on servlet 2.2?

If not: Since 2.3 there exists a setCharacterEncoding()
<quote from="servlet 2.3 javadoc"
href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ ServletRequest.html#setCharacterEncoding(java.lang.String)">
Overrides the name of the character encoding used in the body of this
request. This method must be called prior to reading request
parameters or reading input using getReader().
</quote>

Indeed, the problem here is that it's nowhere specified how the request BODY (not the URL, source of this problem) should be encoded.

Normally, from browser behaviour, I can see that usually browsers tend to post application/www-form-urlencoded in the same charset they used interpreting the form. So given an HTTP request like this:

C: GET /myForm HTTP/1.1
C: Host: localhost:80
C:
S: HTTP/1.1 200 OK
S: Date: Wed, 18 Aug 2004 08:30:28 GMT
S: Server: Apache/2.0.49 (Unix) DAV/2 SVN/1.0.2
S: Content-Type: text/html; charset=utf-8

When the form included in /myForm is posted back to its action, the UTF-8 charset will be used to encode the form data...

That's normally a rule of thumb, and that's why (IMVHO) UTF-8 should be used for all forms, and should always used be as the default encoding for writing and riding.

- I assume the cocoon servlet could easily arrange for calling the
method before anything else

Yes, hoping that it actually works. But cocoon should call the method with the encoding used to send the form from where data is read... should be easy for continuations, but in most of the cases, I'd say that it's a good principle to choose one encoding for your entire application and stick to it...

- I'm a bit unsure here if the javadoc mentioning of 'in the body of
this request' is going to be interpreted by implementations as a
limiting scope, and if so if they include the URI (and the request
params using get vs post) as part of it or not

The point you mentioned in the spec _DOES_NOT_ include the request URI. We've talked quite extensively over it while writing Servlet 2.4, which (in theory) should expand more on the concepts of charset and i18n.

(talk about possible confusion when writing specs like this, yuk!)

Well, it's a big gray area... Most of my knowledge is based on my girlfriend's PC. She's japanese, and although I don't understand what's all that gibberish on her screen, I can still test out few bits and bobs...

For all our MacOS/X folks, if you want to try out playing with different encodings and internationalization settings, close your Safari, Mozilla, Firefox, and so on, go into the System Preferences and drag the three "bookcase, christmas tree, lotsa-lines block" (ni-hon-go) sequence of three characters right up to the top. Start your browser, and then restore english (french, italian, german) up on top where it was in the preferences.

Your browser will now think it's working on a Japanese PC and will do everything like you were living in Tokyo.

On Windows, sorry, your best bet is to actually GO to Tokyo, and buy a copy of WindowsXP in Japanese. :-(

Pier

Attachment: smime.p7s
Description: S/MIME cryptographic signature



Reply via email to