Re: XML-Serializer encoding
christian bindeballe wrote: [...] That's right. My mistake. I merely deducted the encoding from some characters used inside the text of the feeds as for example 8221; which are clearly non-Latin-1 characters. Since both feeds have ISO-8859-1 in their response headers it means that these feeds are either malformatted or malencoded. Don't mix characterset and character encoding. 8221 is decimal notation of unicode character U+201D. iso-8859-1 or utf-8 are just character encodings. Encoding and formatting of both your sources is ok, but selection of iso-8859-1 is a poor choice regarding readability. [...] Do you mean I should register the serializer used with both the parameter charset and the element encoding corresponding (having the same value)? The first one goes into HTTP response header and is needed by any browser to recognize the character encoding of the following content. You may run your own test by just omitting it and checking HTTP response header of your output. The second one is telling the serializer which character encoding to use for the output. [...] I used both xml and html (to see, if there is any difference in the output, but there is none). In the Userdocs it says that you shouldnt't And after serializing, are you still having 8221 in the output or is it converted to double-upper-nine quotation mark? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Edwin Kapauni schrieb: Don't mix characterset and character encoding. 8221 is decimal notation of unicode character U+201D. iso-8859-1 or utf-8 are just character encodings. Encoding and formatting of both your sources is ok, but selection of iso-8859-1 is a poor choice regarding readability. That choice is not mine to make, unfortunately, regarding the feeds I receive, I mean. If you suggest I should use UTF-8 as encoding scheme, I can understand why and I will certainly try to do that. The first one goes into HTTP response header and is needed by any browser to recognize the character encoding of the following content. You may run your own test by just omitting it and checking HTTP response header of your output. The second one is telling the serializer which character encoding to use for the output. OK, understood, that is very useful to know. And after serializing, are you still having 8221 in the output or is it converted to double-upper-nine quotation mark? I figured last night that what was causing my funny output was not the character set nor the encoding, but the way my XSL handled the feeds. Still this whole process of looking into the character encodings and so forth helped me a lot and also made me realize it was not the cause of my problems. Thanks for pointing out these specific details about Unicode and UTF-8, I appreciate it a lot and am very glad I learned something again :) Again, thank you for your time and effort. Best regards, christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
christian bindeballe wrote: Hello Marc, Marc Portier schrieb: snip / OK, so I belive I got something wrong. These characters that I thought to be Unicode-Characters are rather XML-Interpretations? Regarding unicode and encodings, please read this: http://www.joelonsoftware.com/articles/Unicode.html My Shortlist: - avoid using the word 'character' since it's often leading to other interpretations then what you are intending. - use 'glyph' or 'symbol' instead to indicate the typographic idiom people know and write down - understand that the main job of the unicode standard is to assign so called code-points to just about all glyphs that exist out there. These code-points are interchanged between humans in a textual format that starts with U+ and is followed by 4 hexadecimal digits - these code-points are interchanged between computers in byte-sequences, how to map codepoints to byte-sequences is regulated by the encoding - there is more then one encoding to choose from: most common known are iso-8859-1 (latin-1), cp1252, utf-16, utf-8. In other words the same codepoint/glyph can be interchanged in totally different bytesequences - latin-1 is a single byte encoding and doesn't have room for all glyphs in the unicode list... unicode-codepoints for which it doesn't have a byte are mapped to byte 0x80 - utf-8 is a variable-with encoding where depending on the codepoint the encoding might result in a byte sequence of one to (typically) three (but I thought officially up to six) bytes - since an exchanged text-file on disk(cd/usb) or over the net is just a bunch of bytes, it is in fact (theoretically) unreadable if you don't know the applied encoding - xml files allow to specify the encoding of the file itself in the xml declaration (first line of the file, and thus already in a certain encoding:) there is indeed a chicken and egg problem there, and a possible mismatch leading to parser failures if file-encoding doesn't match the declared one - xml files also allow to use so called character-entities to communicate glyphs. Typically they are only used to communicate those glyphs that don't have a valid byte-sequence in the current encoding. These entities folow either one of these patterns: #(codepoint-in-decimal); #x(codepoint-in-hexadecimal); - These entities are resolved (just like gt; lt; $apos; quot; and amp;) by your parser, in other words: in regular XML API's SAX or DOM you will no longer find any reference to them, they got replace by their actual glyph-representation in the programming language of your choice (which in Java actually is utf-16) - These entities are automatically and smartly inserted by the xalan serializers depending on the encoding you force them too There are often Chars like #8221; in the feeds. Since these aren't translated properly and they are not part of Latin-1 I thought they must be UTF-8, which they obviously aren't, or are they? no. utf-8 is nowhere in sight here these sequences are on file-level genuine valid iso-8859-1 byte-sequences that make up a glyph-sequence #8221; which only on XML level is recognised as a 'character entity' and thus interpreted as to be replaced by one single glyph so the question remains: what do you mean by 'not translated correctly'? Note that a final element in this whole discussion is the font you are using: sometimes simple system-fonts don't have a valid glyph-representation available for a perfactly legal communicated codepoint... so you try solving things completely at the wrong end :-( $ wget -q -O - http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | grep '#' are all punctuation chars that seem to be correctly applied see above :) you're more than probably right thx for your confidence :-) http://www.unicode.org/charts/PDF/U2000.pdf I have never used coplets, nor even looked at them (deeply sorry) but I would certainly check the way these feeds are interpreted in the first place (rather then how they are serialized) if that is bad, then nothing furtheron in the pipe will be able to produce decent characterstreams regardless of encoding scheme's you're trying out on the serializer This is the relevant part of my sitemap: map:match pattern=live.rss so this url will actually look like: http://yourserver/cocoon/submap/live.rss?feed=http://whatever.de/some.rss right? map:generate type=file src={request-param:feed} label=content / this will read the mentioned feed and parse it, since the feeds are ok regarding encoding and character entities I suspect all things would be ok map:transform type=xslt src=styles/rss2html.xsl map:parameter name=fullscreen value={coplet:aspectDatas/fullScreen}/ /map:transform map:serialize type=xml/ odd, your stylesheet claims in it's name to be targetting html, yet you serialize as xml, just for debugging maybe? /map:match So my next
Re: XML-Serializer encoding
christian bindeballe wrote: [...] I figured last night that what was causing my funny output was not the character set nor the encoding, but the way my XSL handled the feeds. That's the reason why I recommended you omitting the transformation step and to watch what's happening. map:match pattern=netzpolitik map:generate src=http://www.netzpolitik.org/feed/ map:serialize/ /map:match [...] my problems. Thanks for pointing out these specific details about Unicode and UTF-8, I appreciate it a lot and am very glad I learned something again :) Again, thank you for your time and effort. Suppose you are speaking German? Then maybe you'd like to subscribe to de.comm.infosystems.www.authoring.misc where there are discussed any aspects of (X)HTML. Also http://de.wikipedia.org/wiki/UTF-8 and http://de.wikipedia.org/wiki/Unicode are good references in German. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Edwin Kapauni schrieb: Also http://de.wikipedia.org/wiki/UTF-8 and http://de.wikipedia.org/wiki/Unicode are good references in German. cheers, I looked here: http://en.wikipedia.org/wiki/Character_encoding ;) it mentions the difficulty between distinguishing character sets and character encoding. cb - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
On Tue, 2006-01-17 at 12:40, christian bindeballe wrote: Edwin Kapauni schrieb: Also http://de.wikipedia.org/wiki/UTF-8 and http://de.wikipedia.org/wiki/Unicode are good references in German. cheers, I looked here: http://en.wikipedia.org/wiki/Character_encoding ;) it mentions the difficulty between distinguishing character sets and character encoding. See also the XML FAQ at http://xml.silmaril.ie/authors/characters/ ///Peter - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Edwin Kapauni schrieb: The first one goes into HTTP response header and is needed by any browser to recognize the character encoding of the following content. You may run your own test by just omitting it and checking HTTP response header of your output. The second one is telling the serializer which character encoding to use for the output. Hello again! As I run cocoon inside a Tomcat 5.0.28 installed locally on my machine, I am afraid, I feel I am missing something here. Obviously the container encoding of my tomcat needs to be ISO-8859-1 and that cannot be changed. [quote] Since the servlet specification requires that the ISO-8859-1 encoding is used (by default), you should never change this value unless you have a buggy servlet container.[/quote] So I cannot change the way Tomcat encodes characters, do I get this right? Also, but this may be caused by a local installation, the response headers don't list any encoding :( nor any charset. This is what I get: Response Headers - http://localhost:8080/copo/portal/portal X-Cocoon-Version: 2.1.8 Set-Cookie: JSESSIONID=723AA70AA8294D5DC83E78C1BD490B3C; Path=/copo Cache-Control: no-cache, no-store Pragma: no-cache Expires: Thu, 01 Jan 2000 00:00:00 GMT Content-Type: text/html Content-Length: 6238 Date: Tue, 17 Jan 2006 14:21:44 GMT Server: Apache-Coyote/1.1 200 OK I tried all available sitemaps along the line and entered the parameter charset=UTF-8 to the HTML-Serializer of the Base-Sitemap, also to the HTML-Include-Serializer, to no avail. I don't suppose this part of the response header is not sent, I believe it isn't set. I searched for any hints as where I could change that but apart from some API-Docs didn't find anything (useful at all). So, if anybody has an idea how I could make this happen, I'd be more than grateful for a hint or a solution. Regards, christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
christian bindeballe wrote: [...] [quote] Since the servlet specification requires that the ISO-8859-1 encoding is used (by default), you should never change this value unless you have a buggy servlet container.[/quote] Citation without sources? Where did you get that nonsense from? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Edwin Kapauni schrieb: christian bindeballe wrote: [...] [quote] Since the servlet specification requires that the ISO-8859-1 encoding is used (by default), you should never change this value unless you have a buggy servlet container.[/quote] Citation without sources? Where did you get that nonsense from? I thought I didn't need to mention it, since I think I posted once already, and it is there in the web.xml in my WEB-INF directory of the cocoon build I use. also, Marc Portier wrote: (see this thread, message-ID [EMAIL PROTECTED]) never change your container-encoding unless you have a servlet container of which you can specify the used encoding applied in decoding of url's and request parameters (if you don't understand what I just said: that translates to simply never) confused christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
christian bindeballe wrote: [...] also, Marc Portier wrote: (see this thread, message-ID [EMAIL PROTECTED]) never change your container-encoding unless you have a servlet container of which you can specify the used encoding applied in decoding of url's and request parameters [...] Hi Christian, what I've been writing about is not container-encoding. I was writing about character encoding of the documents and about HTTP response header. And to make even more confusion, there is also form-encoding, url-encoding, document-encoding, transmission encoding, ... As you are going to supply something for web browsers through HTTP, the browsers will need something like Content-Type: text/html;charset=utf-8 in the HTTP response header. And this is given by the *second* line in following serializer configuration: map:serializer name=xhtml mime-type=test/html; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer Best way to test is from a very minimalistic sample application with just this serializer configuration and a short pipeline with only map:sitemap map:components map:serializers default=xml map:serializer name=xhtml mime-type=test/html; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer /map:serializers /map:components map:pipelines map:pipeline map:match pattern=netzpolitik map:generate src=http://www.netzpolitik.org/feed/ map:serialize/ /map:match /map:pipeline /map:pipelines /map:sitemap Try this sample and play with mime-type and encoding and watch your output. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Edwin Kapauni schrieb: map:serializer name=xhtml mime-type=test/html; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer Best way to test is from a very minimalistic sample application with just this serializer configuration and a short pipeline with only map:sitemap map:components map:serializers default=xml map:serializer name=xhtml mime-type=test/html; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer /map:serializers /map:components map:pipelines map:pipeline map:match pattern=netzpolitik map:generate src=http://www.netzpolitik.org/feed/ map:serialize/ /map:match /map:pipeline /map:pipelines /map:sitemap Try this sample and play with mime-type and encoding and watch your output. ok, the output is fine, when I saw the two little tricks you put in the code snippet I figured how it was supposed to work. so used in this snippet the charset-thingy works fine in the response headers. now I only need to find the final serializer used for the portal and add the charset-setting there. thanks a lot, edwin :) christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
As I thought, it was the html-include serializer in {base}/portal/sitemap.xmap that needed some fitting :) cb - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: XML-Serializer encoding
Think you should have no problem at all when you just serialize everything as utf-8: map:serializer logger=sitemap.serializer.xml mime-type=text/xml name=xml pool-grow=4 pool-max=32 pool-min=4 src=org.apache.cocoon.serialization.XMLSerializer encodingUTF-8/encoding /map:serializer AS Hi, I have several newsfeeds that I want to incorporate in my portal, each one of these feeds has its own coplet. but these feeds are encoded differently. some are in ISO-8859-1, others in UTF-8. Now there is no way that I can change the legacy encoding of these. unfortunately it seems that even though I set the encoding of the xml-serializers (in the corresponding pipeline) that I use for those feeds to whatever, the UTF-8-feeds are not displayed properly. is there a way that I can change the encoding in cocoon so the feeds that arrive in encoding a can be changed to encoding b? I wouldn't mind having them all in UTF-8... any help would be very much appreciated. best regards, christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Thank you, Ard I already did that. But it doesn't change anything. I found this in my web.xml in the WEB-INF folder of my cocoon-build: !-- Set encoding used by the container. If not set the ISO-8859-1 encoding will be assumed. Since the servlet specification requires that the ISO-8859-1 encoding is used (by default), you should never change this value unless you have a buggy servlet container. -- init-param param-namecontainer-encoding/param-name param-valueISO-8859-1/param-value /init-param Servlet-Container used is Tomcat 5.0.28 I switched the encoding parameter to UTF-8 to check whether it would work, and it seems to. But still the coplets aren't encoded properly. Then I saw that the whole page is encoded in ISO-8859-1, having been serialized in HTML (as seen in the doctype of the page). So I looked for the HTML-Serializer in my portal/sitemap.xmap and changed the encoding of the html-serializer, too. no difference these are the feed-adresses that I want to incorporate. both don't have an encoding set (do RSS-feeds have to have that?) but they clearly contain UTF-8 encoded characters. http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ http://www.netzpolitik.org/feed/ so, I guess that somewhere along the line from generating to serializing these feeds are messed with in a way that the encoding set in the serializers has no effect whatsoever. suggestions as to where this could be, anyone? it would be greatly appreciated :) regards, christian 2006/1/16, Ard Schrijvers [EMAIL PROTECTED]: Think you should have no problem at all when you just serialize everything as utf-8: map:serializer logger=sitemap.serializer.xml mime-type=text/xml name=xml pool-grow=4 pool-max=32 pool-min=4 src=org.apache.cocoon.serialization.XMLSerializer encodingUTF-8/encoding /map:serializer AS - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
christian b wrote: [...] these are the feed-adresses that I want to incorporate. both don't have an encoding set (do RSS-feeds have to have that?) but they clearly contain UTF-8 encoded characters. http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ http://www.netzpolitik.org/feed/ Hi Christian, Have a look at the HTTP response headers[1] of those feeds. The netzpolitik feed's header clearly states it's iso-8859-1. Recoding will be automagically done on generating xml from that source. When you try with the following snippet in your pipeline, your output will have a parsing error[2] but its source code will be strictly according to encoding settings of your serializer map:match pattern=netzpolitik map:generate src=http://www.netzpolitik.org/feed/ map:serialize/ /map:match In case of http://www.netzpolitik.org/feed/ you go in with iso-8859-1 and come out with utf-8 (if you didn't change the settings of your xml-serializer). You will also have to make sure that character encoding of your output encodingUTF-8/encoding is in accordance with encoding information sent with e.g. mime-type=application/xhtml+xml; charset=utf-8 by your serializer in HTTP response header. The following is an example xhtml serializer config having both these informations. map:serializer name=xhtml mime-type=application/xhtml+xml; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer What generator have you been using for your works. Maybe I didn't fully understand your problem ... [1]http://livehttpheaders.mozdev.org/ for Firefox/Mozilla users [2] XML Parsing Error: not well-formed Location: http://bodo:8080/netzpolitik Line Number 27, Column 17: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
christian b wrote: [...] these are the feed-adresses that I want to incorporate. both don't have an encoding set (do RSS-feeds have to have that?) but they clearly contain UTF-8 encoded characters. http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ http://www.netzpolitik.org/feed/ Hi Christian, Have a look at the HTTP response headers[1] of those feeds. The netzpolitik feed's header clearly states it's iso-8859-1. Recoding will be automagically done on generating xml from that source. When you try with the following snippet in your pipeline, your output will have a parsing error[2] but its source code will be strictly according to encoding settings of your serializer map:match pattern=netzpolitik map:generate src=http://www.netzpolitik.org/feed/ map:serialize/ /map:match In case of http://www.netzpolitik.org/feed/ you go in with iso-8859-1 and come out with utf-8 (if you didn't change the settings of your xml-serializer). You will also have to make sure that character encoding of your output encodingUTF-8/encoding is in accordance with encoding information sent with e.g. mime-type=application/xhtml+xml; charset=utf-8 by your serializer in HTTP response header. The following is an example xhtml serializer config having both these informations. map:serializer name=xhtml mime-type=application/xhtml+xml; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer What generator have you been using for your works. Maybe I didn't fully understand your problem ... [1]http://livehttpheaders.mozdev.org/ for Firefox/Mozilla users [2] XML Parsing Error: not well-formed Location: http://localhost:8080/netzpolitik Line Number 27, Column 17: - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
christian b wrote: Thank you, Ard I already did that. But it doesn't change anything. I found this in my web.xml in the WEB-INF folder of my cocoon-build: !-- Set encoding used by the container. If not set the ISO-8859-1 encoding will be assumed. Since the servlet specification requires that the ISO-8859-1 encoding is used (by default), you should never change this value unless you have a buggy servlet container. -- init-param param-namecontainer-encoding/param-name param-valueISO-8859-1/param-value /init-param Servlet-Container used is Tomcat 5.0.28 I switched the encoding parameter to UTF-8 to check whether it would work, and it seems to. But still the coplets aren't encoded properly. never change your container-encoding unless you have a servlet container of which you can specify the used encoding applied in decoding of url's and request parameters (if you don't understand what I just said: that translates to simply never) e.g. when you use jetty (the only one I know) you can specifiy a system property -Dorg.mortbay.util.URI.charset=utf-8 only then the cocoon servlet init param should be changed to match that Then I saw that the whole page is encoded in ISO-8859-1, having been serialized in HTML (as seen in the doctype of the page). So I looked for the HTML-Serializer in my portal/sitemap.xmap and changed the encoding of the html-serializer, too. no difference these are the feed-adresses that I want to incorporate. both don't have an encoding set (do RSS-feeds have to have that?) but they clearly contain UTF-8 encoded characters. like where? I just did a rough scan but couldn't find any 'multiple byte for single character' occurances note that many 'at first glance odd' characters DO have a valid position in the ISO-8859-1 encoding e.g. U+00DF, the typical german LATIN SMALL LETTER SHARP S = Eszett is just encoded as the single byte hex DF in latin 1 it's not that because a certain character requires 2 bytes in UTF-8 encoding that this character _IS_ an UTF_8 encoded char, the same character might very well have a valid and usefule single byte latin 1 encoding. (in other words: the 'encoding' is never a property of the glyph, but I admit: yeah, some glyphs don't have representations in all encodings) http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ the http header states that the file is iso-8859-1 encoded: (see the content-type header) $ wget -S --spider http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ --15:33:26-- http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ = `index.html' Resolving www.industrial-technology-and-witchcraft.de... 212.227.64.59 Connecting to www.industrial-technology-and-witchcraft.de|212.227.64.59|:80... connected. HTTP request sent, awaiting response... HTTP/1.0 200 OK Date: Mon, 16 Jan 2006 14:33:26 GMT Server: Apache/1.3.33 (Unix) Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Expires: Mon, 16 Jan 2006 13:40:42 GMT Pragma: no-cache X-Powered-By: PHP/4.4.1 Set-Cookie: exp_last_visit=822058407; expires=Tue, 16 Jan 2007 14:33:27 GMT; path=/ Set-Cookie: exp_last_activity=1137418407; expires=Tue, 16 Jan 2007 14:33:27 GMT; path=/ Set-Cookie: exp_tracker=a%3A1%3A%7Bi%3A0%3Bs%3A15%3A%22%2FITW%2Fitw-rss20%2F%22%3B%7D; path=/ Last-Modified: Mon, 16 Jan 2006 12:40:42 GMT Content-Type: text/xml; charset=iso-8859-1; X-Cache: MISS from proxy2 X-Cache-Lookup: MISS from proxy2:8080 Connection: keep-alive Length: unspecified [text/xml] 200 OK and going with that the feed's xml declaration is nicely claiming the same: $ wget -q -O - http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | head -1 ?xml version=1.0 encoding=iso-8859-1? at first glance it also looks like a valid claim, with special characters nicely encoded as XML entities the ones I found with: $ wget -q -O - http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | grep '#' are all punctuation chars that seem to be correctly applied http://www.netzpolitik.org/feed/ this one also has ISO_8859_1 encoding according to http header and xml declaration so both seem ok so, I guess that somewhere along the line from generating to serializing these feeds are messed with in a way that the encoding set in the serializers has no effect whatsoever. suggestions as to where this could be, anyone? I have never used coplets, nor even looked at them (deeply sorry) but I would certainly check the way these feeds are interpreted in the first place (rather then how they are serialized) if that is bad, then nothing furtheron in the pipe will be able to produce decent characterstreams regardless of encoding scheme's you're trying out on the serializer so, what do you do
Re: XML-Serializer encoding
Edwin Kapauni schrieb: Hi Christian, Have a look at the HTTP response headers[1] of those feeds. The netzpolitik feed's header clearly states it's iso-8859-1. That's right. My mistake. I merely deducted the encoding from some characters used inside the text of the feeds as for example 8221; which are clearly non-Latin-1 characters. Since both feeds have ISO-8859-1 in their response headers it means that these feeds are either malformatted or malencoded. Recoding will be automagically done on generating xml from that source. When you try with the following snippet in your pipeline, your output will have a parsing error[2] but its source code will be strictly according to encoding settings of your serializer map:match pattern=netzpolitik map:generate src=http://www.netzpolitik.org/feed/ map:serialize/ /map:match That depends on the serializer used. I configured the xml-serializer in the sitemap for those feeds to be encoding to UTF-8: no parsing error. In case of http://www.netzpolitik.org/feed/ you go in with iso-8859-1 and come out with utf-8 (if you didn't change the settings of your xml-serializer). You will also have to make sure that character encoding of your output encodingUTF-8/encoding is in accordance with encoding information sent with e.g. mime-type=application/xhtml+xml; charset=utf-8 Do you mean I should register the serializer used with both the parameter charset and the element encoding corresponding (having the same value)? by your serializer in HTTP response header. The following is an example xhtml serializer config having both these informations. map:serializer name=xhtml mime-type=application/xhtml+xml; charset=utf-8 logger=sitemap.serializer.xhtml pool-grow=2 pool-max=64 pool-min=2 src=org.apache.cocoon.components.serializers.XHTMLSerializer encodingUTF-8/encoding indentno/indent /map:serializer What generator have you been using for your works. Maybe I didn't fully understand your problem ... I used both xml and html (to see, if there is any difference in the output, but there is none). In the Userdocs it says that you shouldnt't use the charset-parameter but rather have the encoding set properly. at least this applies to the xml and html-serializers. I used the same setting that is in use for the newsfeeds in the sample-portal shipped with cocoon. [1]http://livehttpheaders.mozdev.org/ for Firefox/Mozilla users Thanks, that was a useful hint. It reminded me of the WebDeveloper-Extension I have installed ;) Best regards, christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: XML-Serializer encoding
Hello Marc, Marc Portier schrieb: never change your container-encoding unless you have a servlet container of which you can specify the used encoding applied in decoding of url's and request parameters (if you don't understand what I just said: that translates to simply never) I think I got it :) It also said that in the comments of the web.xml - file, as to never change it unless the servlet-container is buggy (which I suppose Tomcat 5.0.28 is not), but I thought I might give it a shot. But since that didn't help I changed it back to the original setting like where? I just did a rough scan but couldn't find any 'multiple byte for single character' occurances OK, so I belive I got something wrong. These characters that I thought to be Unicode-Characters are rather XML-Interpretations? There are often Chars like #8221; in the feeds. Since these aren't translated properly and they are not part of Latin-1 I thought they must be UTF-8, which they obviously aren't, or are they? $ wget -q -O - http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | grep '#' are all punctuation chars that seem to be correctly applied see above :) you're more than probably right I have never used coplets, nor even looked at them (deeply sorry) but I would certainly check the way these feeds are interpreted in the first place (rather then how they are serialized) if that is bad, then nothing furtheron in the pipe will be able to produce decent characterstreams regardless of encoding scheme's you're trying out on the serializer This is the relevant part of my sitemap: map:match pattern=live.rss map:generate type=file src={request-param:feed} label=content / map:transform type=xslt src=styles/rss2html.xsl map:parameter name=fullscreen value={coplet:aspectDatas/fullScreen}/ /map:transform map:serialize type=xml/ /map:match So my next thought was that it is the XSL that is messing up the RSS. So I edited the XSL and added this line after the xsl:stylesheet xsl:output method=html encoding=ISO-8859-1/ but it didn't help either. Maybe someone would like to take a look at the xsl I attached to see whether there is something wrong with it? on the side: you don't need to set your serializer specific encoding if you have set the form-encoding init param in the web.xml to utf-8 (which I would suggest at all times) done. and thanks a lot for your effort, everybody. I really appreciate that :) best regards, christian ?xml version=1.0? !-- Copyright 1999-2004 The Apache Software Foundation Licensed under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- xsl:stylesheet version=1.0 xmlns:xsl=http://www.w3.org/1999/XSL/Transform; xsl:output method=html encoding=ISO-8859-1/ !-- $Id: rss2html.xsl 30932 2004-07-29 17:35:38Z vgritsenko $ -- xsl:param name=fullscreen/ xsl:template match=rss xsl:apply-templates select=channel/ /xsl:template xsl:template match=channel xsl:if test=title ba href={link}xsl:value-of select=title//a/b br/ /xsl:if xsl:if test=description font size=-3#160;(xsl:value-of select=description/)/font /xsl:if table xsl:apply-templates select=item/ /table /xsl:template xsl:template match=item !-- Display the first 5 entries -- xsl:if test=$fullscreen='true' or position() lt; 6 tr td a target=_blank href={link} font size=-1 bxsl:value-of select=title//b /font /a xsl:apply-templates select=description/ /td /tr trtd height=5#160;/td/tr /xsl:if /xsl:template xsl:template match=description font size=-2 br/ #160;#160;xsl:apply-templates/ /font /xsl:template xsl:template match=node()|@* priority=-1 xsl:copy xsl:apply-templates select=@*/ xsl:apply-templates/ /xsl:copy /xsl:template /xsl:stylesheet - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
XML-Serializer encoding
Hi, I have several newsfeeds that I want to incorporate in my portal, each one of these feeds has its own coplet. but these feeds are encoded differently. some are in ISO-8859-1, others in UTF-8. Now there is no way that I can change the legacy encoding of these. unfortunately it seems that even though I set the encoding of the xml-serializers (in the corresponding pipeline) that I use for those feeds to whatever, the UTF-8-feeds are not displayed properly. is there a way that I can change the encoding in cocoon so the feeds that arrive in encoding a can be changed to encoding b? I wouldn't mind having them all in UTF-8... any help would be very much appreciated. best regards, christian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]