Re: [Help]How can I use non-ascii file name?

Marc Portier Mon, 16 Aug 2004 04:06:13 -0700

Pier,

As a coincidence we recently (last week) had a similar post on xreporter-list (which uses cocoon)

Bad news is that I didn't track it down to the bottom yet, just some findings below: (in fact the odd-char-in-filename for map:read and map:mount was one of the first things I was going to test, seems I'm already presented with the results)


what I did find already was this:

Cocoon's Request.getSitemapURI() will return an assembly of javax.servlet.http.HttpServletRequest.getServletPath() + javax.servlet.http.HttpServletRequest.getPathInfo()

Servlet spec on those states they will be (url-) decoded Thus 3 char sequences of the kind "%BYTE_HEX" will have been translated into single bytes. The obtained byte-sequence is then decoded using SOME_DECODING (my guess would be using ISO-8859-1, but haven't found yet if this is container specific, modifiable or hard noted in some spec. Only thing I found is this: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars, but I'm yet unsure on how this influences servlet specs, or actual container and even browser implementations for that matter)


Alternatively there is:
Cocoon's Request.getRequestURI() which maps onto the
javax.servlet.http.HttpServletRequest.getRequestURI()

This one resembles the URI as transferred over the wire: ie. not (url-)decoded, or in other words still holding the %XX sequences

As an extra clarification on all these the servlet spec explicitely states: (2.3 version, page 34, section SRV4.4 Request Path Elements) <quote> It is important to note that, *except for URL encoding differences* between the request URI and the path parts, the following equation is always true:

requestURI = contextPath + servletPath + pathInfo
</quote>

I (for now) assume that this is the same encoding we expect cocoon-deploy people to specify in the 'container-encoding' init-parameter in the web.xml (allowing to correctly en-re-decode request-paramater-values in case of mismatching form and container encodings)

Ok, above is dull data, and not much into a direction of any solution yet. My current feeling (long shot, needs time to test and try, and based on above assumption) is that we should

In terms of backwards compatibility I'm unsure if we could just go about changing the semantics (histrocally implied use of iso-8859-1 encoding) of getSitemapURI() or rather should deprecate and/or have a different method next to it?

In any case this new implementation should then probably apply the same kind of dirty en-re-decoding-trick

new return(getSitemapURI().getBytes(container_encoding),form_encoding)

as we do today with the request param values?

(see http://cvs.apache.org/viewcvs.cgi/cocoon-2.1/src/java/org/apache/cocoon/environment/http/HttpRequest.java?annotate=1.11#391 sorry for the old cvs-style link, the svn version of viewcvs doesn't seem to support 'annotate' ?)

For the record: the fast hack/workaround in the xreporter case was exactly to apply this.

Attached to this I'm also seeing the trouble of mount-points in cocoon. I've seen a number of installments needing (well, 'using' at least) some insertion of that part-of-the-URL-that-maps-to-the-mounted-sitemap to be able to have links in source xml.files refer to other resources managed by the same mounted sitemap without the need to explicitely mention that part (but have it dynamically inserted by some xsl in stead).

In those occasions I've seen people mostly subtract siteMapURI from requestURI to obtain that prefix part. Regarding the above observations this algorithm will however fail due to encoding differences.

My proposal would be to not only add a method for decoding the sitemapURI properly, but in the mean time adding the convenience method to return the mounted-sitemap-part as well on the level of cocoon's request.

Above are early observations that need some backing, so comments welcome. (and hoping someone beats me to this since I'm lacking the time to pursue myself) -marc=

Pier Fumagalli wrote:

On 12 Aug 2004, at 12:45, roy huang wrote:
Hi,all: Use reader to display jpg or gif is quite simple,like: <map:match pattern="*.jpg"> <map:read mime-type="image/jpg" src="jpg/{1}.jpg" /> </map:match> But if the file name is not ASCII but utf-8 or other encoding like è.jpg (simplified Chinese),the resolver didn't resolve the name correctly,error occur: org.apache.cocoon.ResourceNotFoundException: Error during resolving of the input stream: org.apache.excalibur.source.SourceNotFoundException: file:/C:/My Documents/IBM/wsad/workspace/PowerOA/WebContent/test/jpg/ÃÂÂ.jpg doesn't exist.

How can I use non-ASCII file name in cocoon?I can't find any description or help in wiki or archived mail list.

Roy Huang
It appears indeed as a bug...
I have this sitemap snippet:
    <map:match pattern="è*">
      <map:generate src="è{1}.xml"/>
      <map:transform src="welcome.xslt">
        <map:parameter name="contextPath" value="{request:contextPath}"/>
      </map:transform>
      <map:serialize type="xhtml"/>
    </map:match>
and a file on the disk called "èçå.xml". Somewhere, when I make a request for "http://localhost:8888/èçå";, the whole thing goes berserk...
Now, the URL is passed correctly, as I see that in the access log:
INFO (2004-08-16) 10:26.36:538 [access] (/%e8%b0%b7%e7%90%86%e5%ad%90) main-3/CocoonServlet: '????????' Processed by Apache Cocoon 2.1.5 in 27 milliseconds.

The above-mentioned string's encoding in UTF-8 is, in fact, "E8 B0 B7 E7 90 86 E5 AD 90", so, cocoon receives it correctly, but somehow it gets lost in the process.
Now, if I modify my itemap to
    <map:match pattern="tanisatoko">
      <map:generate src="èçå.xml"/>
      <map:transform src="welcome.xslt">
        <map:parameter name="contextPath" value="{request:contextPath}"/>
      </map:transform>
      <map:serialize type="xhtml"/>
    </map:match>
And I make a request to "http://localhost:8888/tanisatoko";, the thing works perfectly. We can safely exclude the fact that it's the generation process.

Now, the _odd_ thing I noticed is that in those cases, I get an error of "PipelineNotFound", not a "ResourceNotFound", which means that the matcher seriously doesn't see that request.

Changing over the matcher to a 'regexp' matcher doesn't change, so, I bet it's the data we feed to the matcher.

Now, changing that matcher to "è°·çå", the encoding, and running it again, I get my nice page correctly.

I bet that somewhere (I don't know where, but surely somewhere), the UTF-8 encoded URL converted into a string using the current locale (MacRoman on my system), or a default of "ISO-8859-1", before the string is actually given to the sitemap.

Not having the sources at hand at the moment, I can't do a quick build to put out some debugging instruction, but you get the idea.
    Pier


--
Marc Portier                            http://outerthought.org/
Outerthought - Open Source, Java & XML Competence Support Center
Read my weblog at                http://blogs.cocoondev.org/mpo/
[EMAIL PROTECTED]                              [EMAIL PROTECTED]

Re: [Help]How can I use non-ascii file name?

Reply via email to