Re: [Help]How can I use non-ascii file name?

Marc Portier Thu, 19 Aug 2004 01:16:07 -0700

Pier Fumagalli wrote:

On 17 Aug 2004, at 16:20, Marc Portier wrote:
How about setting it up as the default behavior for Cocoon's internal Jetty distro?
makes sense, but: (whishing all this brokenness wan't there but helas)
It's not really "brokenness" but more along the lines of an inversion of the Robustness Principle, as outlined by J. Postel in RFC-791 (http://www.rfc-editor.org/rfc/rfc791.txt section 3.2) and later dogmatized by R. Braden in RFC-1122 (http://www.rfc-editor.org/rfc/rfc1122.txt Section 1.2.2).
"Be liberal in what you accept, and conservative in what you send."
In this case browsers are liberal in what they send (URL-Encoded UTF-8) and servlet containers are conservative in what they accept (URL-Encoded ISO-8859-1).


indeed

- it shouldn't keep us from actually get about solving it for all
containers? (my guess is that just a fraction of cocoon deployments
actually run on the internal jetty distro, i.e. using the cocoon.sh or
.bat?)
Well, we found that Jetty in production was much better than anyone else. So, in our production environment we have Jetty (not the Cocoon distro one, a full blown copy)... Works pretty neatly! :-P
- learning about this org.mortbay.util.URI.charset property we should
probably use it to override (or at least log-warn deployers if it's
different to) the container-encoding setting in the web.xml
(assuming that the mentioned property will also be in effect when
decoding the request parameters, and taking in account that current
cocoon code assumes ISO-8859-1 as the default there)
I agree, but as I said, my world revolves around the best container in the world (whops, Jetty), so I already have "my" fix to the problem: switch! :-P
- once we've run that far, we might even consider making a scan of  other
servlet containers and how they possibly allow setting the
container-encoding?
The "conteiner-encoding" servlet initialization parameter simply applies for request parameters (form data), and I suppose it only affects how the way in which from the ServletRequest.getInputStream() we read full blown characters, and parse forms.

I'ld need to check but assume the request params are included regardless off the GET or POST method

of course the uri-part before ? would need to been used already internally in the servlet container at least to point to the correct JSP or servlet...

hm, I'ld need to try-out some jsp/servlet with a euro-sign in the file-name or so and check whether the path indication in the web.xml is able to find it...

while typing I started rethinking why we ended up with this
container-encoding init-param in web.xml?

IIRC we did that because of required compliance to servlet spec  versions
prior to 2.3?  So first question is are we still on servlet 2.2?


Just found the thread that answers the question:
http://marc.theaimsgroup.com/?l=xml-cocoon-dev&m=108858029423811&w=2

If not: Since 2.3 there exists a setCharacterEncoding() <quote from="servlet 2.3 javadoc" href="http://java.sun.com/products/servlet/2.3/javadoc/javax/servlet/ ServletRequest.html#setCharacterEncoding(java.lang.String)"> Overrides the name of the character encoding used in the body of this request. This method must be called prior to reading request parameters or reading input using getReader(). </quote>
Indeed, the problem here is that it's nowhere specified how the request BODY (not the URL, source of this problem) should be encoded.

yep, but as stated above: I suppose that the border-case 'request-params in GET mode' is included (even if those are -stricktly speaking- not in the body?).

This seems to suggest that the current use of the en-re-decoding trick in cocoon's request-wrapper could be cleaned out (since we voted to go with 2.3 from now on)

Normally, from browser behaviour, I can see that usually browsers tend to post application/www-form-urlencoded in the same charset they used interpreting the form. So given an HTTP request like this:
C: GET /myForm HTTP/1.1
C: Host: localhost:80
C:
S: HTTP/1.1 200 OK
S: Date: Wed, 18 Aug 2004 08:30:28 GMT
S: Server: Apache/2.0.49 (Unix) DAV/2 SVN/1.0.2
S: Content-Type: text/html; charset=utf-8
When the form included in /myForm is posted back to its action, the UTF-8 charset will be used to encode the form data...

That's normally a rule of thumb, and that's why (IMVHO) UTF-8 should be used for all forms, and should always used be as the default encoding for writing and riding.


yep,
we have wiki info already indicating that to our users:
http://wiki.apache.org/cocoon/RequestParameterEncoding

(hm, more interesting stuff out there, and probably some of the new viewpoints from this thread could be added there)

- I assume the cocoon servlet could easily arrange for calling the
method before anything else
Yes, hoping that it actually works. But cocoon should call the method with the encoding used to send the form from where data is read...


yep, they should be consistent.
fact is there was a patch on the serializers to do so by default

(but the other way around: by default they are taking the setting of form_encoding init param for doing the serialization)

fixcommit here:
http://cvs.apache.org/viewcvs.cgi/cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java?r1=24666&r2=26246&p1=cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java&p2=cocoon/trunk/src/java/org/apache/cocoon/serialization/AbstractTextSerializer.java&diff_format=h&root=Apache-SVN

archived discussion here: http://marc.theaimsgroup.com/?t=106760662600010&r=1&w=2

should be easy for continuations, but in most of the cases, I'd say that it's a good principle to choose one encoding for your entire application and stick to it...

agree, just running through the (above mentioned) wiki page however I noticed some paragraph on wanting to 'locally' override the form-encoding for certain pipelines (use case being support for different clients then only the classic browsers which might behave differently)

the suggested setCharacterEncodingAction seems to be a good match to that issue and it somewhat suggests we should keep some form of possible en-re-decoding scheme in our request-wrapper (looks like the 2.3 switch should not make us jump to hasty conclusions on that part)

(boy this issue seems to be a rose with many thorns, and it seems to blossom every year or so :-))

- I'm a bit unsure here if the javadoc mentioning of 'in the body of
this request' is going to be interpreted by implementations as a
limiting scope, and if so if they include the URI (and the request
params using get vs post) as part of it or not
The point you mentioned in the spec _DOES_NOT_ include the request URI. We've talked quite extensively over it while writing Servlet 2.4, which (in theory) should expand more on the concepts of charset and i18n.


thx for the clarrification and inside info

(talk about possible confusion when writing specs like this, yuk!)
Well, it's a big gray area... Most of my knowledge is based on my girlfriend's PC. She's japanese, and although I don't understand what's all that gibberish on her screen, I can still test out few bits and bobs...

For all our MacOS/X folks, if you want to try out playing with different encodings and internationalization settings, close your Safari, Mozilla, Firefox, and so on, go into the System Preferences and drag the three "bookcase, christmas tree, lotsa-lines block" (ni-hon-go) sequence of three characters right up to the top. Start your browser, and then restore english (french, italian, german) up on top where it was in the preferences.

Your browser will now think it's working on a Japanese PC and will do everything like you were living in Tokyo.

On Windows, sorry, your best bet is to actually GO to Tokyo, and buy a copy of WindowsXP in Japanese. :-(

yeah testing isn't obvious as one also needs to rely on having a as-unicode-complete-as-they-come font so you are sure you are seeing what you think you are seeing...

any case: my personal testing-candidate for these cases is just using the euro-sign (\u20AC, utf-8: %E2%82%AC) in pathnames, filenames, classnames, request params and whatnot.

most european systems (even windows) would have a native encoding supporting the eurosign (while iso-8859-1 obviously doesn't)

geek detail: you can even use it in your Java source code:

public class \u20ACToBEF
{
...
}

(in fact java's compiler is completely unicode aware towards the source code: if you're sick enough you might even go about writing the keywords like 'public' and 'class' in their escaped unicode variants :-) notice that you will need to be able to specify an euro-sign in the filename of that source though)


regards,
-marc=
--
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
[EMAIL PROTECTED]                              [EMAIL PROTECTED]

Re: [Help]How can I use non-ascii file name?

Reply via email to