On 20/05/2010, at 18:41, Sjur Moshagen wrote: > Den 20. mai. 2010 kl. 15.26 skrev Thorsten Scherler: > >> On 20/05/2010, at 14:18, Sjur Moshagen wrote: >>>> ... >>>> Hmm, that is weird. Please try the following: >>>> - add a new contract that uses ñ, í and similar characters >>>> - see what comes out >>> >>> I added a blank contract that just printed the same line of characters I >>> used earlier for testing, and this is what came out: >>> >>> This is a text containing problematic characters: >>> a á c č d đ n ŋ s š t ŧ z ž ae æ oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ >>> >>> That is, the text from the contract comes through just fine, but text >>> coming from a standard Forrest v2 document gets garbled. >>> >>> I have attached a picture of the page as it renders. The box comes from the >>> document, the text at the bottom is from the contract. >> >> Ok I see. >> >> Please post the dataUri you use for the contract. It seems that the utf-8 is >> lost in this step. If you have the dataUrl of the contract see what is >> coming out there, whether it is already scrambled or not. > > I'm not sure about how to do this, but I'll try. The dataUri used in the > structurer is: > > <forrest:contract name="content-main" > dataURI="cocoon://#{$getRequest}.body.xml"> <-- this is the > dataURI > <forrest:property name="content-main-conf"> > <headings type="boxed"/> > </forrest:property> > </forrest:contract> > > which I take to mean: > > http://localhost:8888/index.body.xml
correct, that was the uri I needed. > > The text returned by that Uri is: > > <?xml version="1.0" encoding="ISO-8859-1"?><div id="content"><h1>Divvun - > Sámi proofing tools project</h1><div id="content-main"> > > <div class="note"><div class="label">UTF-8 character test</div><div > class="content"> > There seems to be problems with certain characters, but only in > Dispatcher:<br xmlns:xi="http://www.w3.org/2001/XInclude"/> > a á c č d đ n ŋ s š t ŧ z ž ae æ > oe ø ao å a¨ ä o¨ ö g ǥ h ħ u ʉ i ɨ > </div></div> > > </div></div> > > Two things to note here: > > The encoding is specified as ISO-8859-1, which is wrong, yes should be utf8. > and which leads to all characters outside Latin1 to be encoded as numeric > entities. actually the numeric form is fine or at least should be. In my use case I take rss from roller and the characters coming as numeric but with utf-8 encoding. > In the next step, this causes all non-ASCII, non-Latin1 characters to survive > correctly, while the Latin1 chars will be messed up when they are > reinterpreted as UTF-8 later - or something along these line. Yeah, it seems the numeric form is working fine but the "native" form does not play nice. I wonder if we change the encoding of the *.body.xml returned doc whether that fixes that problem. > > I don't know where the encoding comes from - everything on my end is marked > as UTF-8. I grepped for the string "ISO-8859-1" in the Forrest sources, and > got many hits, but nothing that seemed to relate to Dispatcher. The *.body.xml comes from the dataModel.xmap: <!-- HTML rendered from intermediate format --> <map:match pattern="**.body.xml"> <map:generate src="cocoon:/{1}.source.rewritten.xml" /> <map:transform src="{lm:dataModel-html-document-to-html.xsl}"> <map:parameter name="path" value="{1}.html" /> </map:transform> <map:serialize /> </map:match> The serializer here is the default one. we define it in the xmap as <map:serializers default="xml" /> That should read: <map:serializers default="xml-utf8" /> I added to revision 946939 please see whether that fixes the issue. I added a test note to org.apache.forrest.plugin.internal.dispatcher/src/documentation/content/xdocs/index.xml so you can directly run "forrest run" in the plugin and see the outcome. If we done testing we should remove the debug note. salu2 Thorsten Scherler <thorsten.at.apache.org> Open Source Java <consulting, training and solutions>