Re: XML-Serializer encoding

2006-01-17 Thread Edwin Kapauni

christian bindeballe wrote:
[...]
That's right. My mistake. I merely deducted the encoding from some 
characters used inside the text of the feeds as for example 8221; which 
are clearly non-Latin-1 characters. Since both feeds have ISO-8859-1 in 
their response headers it means that these feeds are either malformatted 
or malencoded.


Don't mix characterset and character encoding.
8221 is decimal notation of unicode character U+201D.
iso-8859-1 or utf-8 are just character encodings.

Encoding and formatting of both your sources is ok, but selection of 
iso-8859-1 is a poor choice regarding readability.


[...]
Do you mean I should register the serializer used with both the 
parameter charset and the element encoding corresponding (having the 
same value)?


The first one goes into HTTP response header and is needed by any 
browser to recognize the character encoding of the following content. 
You may run your own test by just omitting it and checking HTTP response 
header of your output.
The second one is telling the serializer which character encoding to use 
for the output.


[...]
I used both xml and html (to see, if there is any difference in the 
output, but there is none). In the Userdocs it says that you shouldnt't 


And after serializing, are you still having 8221 in the output or is 
it converted to double-upper-nine quotation mark?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread christian bindeballe

Edwin Kapauni schrieb:

Don't mix characterset and character encoding.
8221 is decimal notation of unicode character U+201D.
iso-8859-1 or utf-8 are just character encodings.

Encoding and formatting of both your sources is ok, but selection of 
iso-8859-1 is a poor choice regarding readability.


That choice is not mine to make, unfortunately, regarding the feeds I 
receive, I mean. If you suggest I should use UTF-8 as encoding scheme, I 
can understand why and I will certainly try to do that.


The first one goes into HTTP response header and is needed by any 
browser to recognize the character encoding of the following content. 
You may run your own test by just omitting it and checking HTTP response 
header of your output.
The second one is telling the serializer which character encoding to use 
for the output.


OK, understood, that is very useful to know.

And after serializing, are you still having 8221 in the output or is 
it converted to double-upper-nine quotation mark?


I figured last night that what was causing my funny output was not the 
character set nor the encoding, but the way my XSL handled the feeds.
Still this whole process of looking into the character encodings and so 
forth helped me a lot and also made me realize it was not the cause of 
my problems. Thanks for pointing out these specific details about 
Unicode and UTF-8, I appreciate it a lot and am very glad I learned 
something again :) Again, thank you for your time and effort.


Best regards,

christian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread Marc Portier


christian bindeballe wrote:
 Hello Marc,
 
 Marc Portier schrieb:
 

snip /

 
 OK, so I belive I got something wrong. These characters that I thought
 to be Unicode-Characters are rather XML-Interpretations?

Regarding unicode and encodings, please read this:
http://www.joelonsoftware.com/articles/Unicode.html

My Shortlist:

- avoid using the word 'character' since it's often leading to other
interpretations then what you are intending.
- use 'glyph' or 'symbol' instead to indicate the typographic idiom
people know and write down
- understand that the main job of the unicode standard is to assign so
called code-points to just about all glyphs that exist out there. These
code-points are interchanged between humans in a textual format that
starts with U+ and is followed by 4 hexadecimal digits
- these code-points are interchanged between computers in
byte-sequences, how to map codepoints to byte-sequences is regulated by
the encoding
- there is more then one encoding to choose from: most common known are
iso-8859-1 (latin-1), cp1252, utf-16, utf-8. In other words the same
codepoint/glyph can be interchanged in totally different bytesequences
- latin-1 is a single byte encoding and doesn't have room for all glyphs
in the unicode list... unicode-codepoints for which it doesn't have a
byte are mapped to byte 0x80
- utf-8 is a variable-with encoding where depending on the codepoint the
encoding might result in a byte sequence of one to (typically) three
(but I thought officially up to six) bytes
- since an exchanged text-file on disk(cd/usb) or over the net is just a
bunch of bytes, it is in fact (theoretically) unreadable if you don't
know the applied encoding
- xml files allow to specify the encoding of the file itself in the xml
declaration (first line of the file, and thus already in a certain
encoding:) there is indeed a chicken and egg problem there, and a
possible mismatch leading to parser failures if file-encoding doesn't
match the declared one
- xml files also allow to use so called character-entities to
communicate glyphs. Typically they are only used to communicate those
glyphs that don't have a valid byte-sequence in the current encoding.
These entities folow either one of these patterns:
  #(codepoint-in-decimal);
  #x(codepoint-in-hexadecimal);
- These entities are resolved (just like gt; lt; $apos; quot; and
amp;) by your parser, in other words: in regular XML API's SAX or DOM
you will no longer find any reference to them, they got replace by their
actual glyph-representation in the programming language of your choice
(which in Java actually is utf-16)
- These entities are automatically and smartly inserted by the xalan
serializers depending on the encoding you force them too

 There are often Chars like #8221; in the feeds. Since these aren't
 translated properly and they are not part of Latin-1 I thought they must
 be UTF-8, which they obviously aren't, or are they?
 

no. utf-8 is nowhere in sight here

these sequences are on file-level genuine valid iso-8859-1
byte-sequences that make up a glyph-sequence #8221;

which only on XML level is recognised as a 'character entity' and thus
interpreted as to be replaced by one single glyph


so the question remains: what do you mean by 'not translated correctly'?

Note that a final element in this whole discussion is the font you are
using: sometimes simple system-fonts don't have a valid
glyph-representation available for a perfactly legal communicated
codepoint... so you try solving things completely at the wrong end :-(


 
 $ wget -q -O -
 http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
 | grep '#'



 are all punctuation chars that seem to be correctly applied
 
 
 see above :) you're more than probably right
 

thx for your confidence :-)
http://www.unicode.org/charts/PDF/U2000.pdf

 I have never used coplets, nor even looked at them (deeply sorry)
 but I would certainly check the way these feeds are interpreted in the
 first place (rather then how they are serialized)

 if that is bad, then nothing furtheron in the pipe will be able to
 produce decent characterstreams regardless of encoding scheme's you're
 trying out on the serializer
 
 
 This is the relevant part of my sitemap:
 map:match pattern=live.rss

so this url will actually look like:

http://yourserver/cocoon/submap/live.rss?feed=http://whatever.de/some.rss

right?

 map:generate type=file src={request-param:feed}
 label=content /

this will read the mentioned feed and parse it, since the feeds are ok
regarding encoding and character entities I suspect all things would be ok

 map:transform type=xslt src=styles/rss2html.xsl
 map:parameter name=fullscreen
 value={coplet:aspectDatas/fullScreen}/
 /map:transform
 map:serialize type=xml/

odd, your stylesheet claims in it's name to be targetting html, yet you
serialize as xml, just for debugging maybe?

 /map:match
 
 So my next 

Re: XML-Serializer encoding

2006-01-17 Thread Edwin Kapauni

christian bindeballe wrote:
[...]
 I figured last night that what was causing my funny output was not
 the character set nor the encoding, but the way my XSL handled the
 feeds.

That's the reason why I recommended you omitting the transformation step 
and to watch what's happening.


  map:match pattern=netzpolitik
map:generate src=http://www.netzpolitik.org/feed/
map:serialize/
  /map:match

[...]
my problems. Thanks for pointing out these specific details about 
Unicode and UTF-8, I appreciate it a lot and am very glad I learned 
something again :) Again, thank you for your time and effort.


Suppose you are speaking German? Then maybe you'd like to subscribe to

de.comm.infosystems.www.authoring.misc

where there are discussed any aspects of (X)HTML.

Also http://de.wikipedia.org/wiki/UTF-8
and  http://de.wikipedia.org/wiki/Unicode
are good references in German.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread christian bindeballe

Edwin Kapauni schrieb:



Also http://de.wikipedia.org/wiki/UTF-8
and  http://de.wikipedia.org/wiki/Unicode
are good references in German.


cheers, I looked here:

http://en.wikipedia.org/wiki/Character_encoding ;)

it mentions the difficulty between distinguishing character sets and 
character encoding.


cb

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread Peter Flynn
On Tue, 2006-01-17 at 12:40, christian bindeballe wrote:
 Edwin Kapauni schrieb:
 
  
  Also http://de.wikipedia.org/wiki/UTF-8
  and  http://de.wikipedia.org/wiki/Unicode
  are good references in German.
 
 cheers, I looked here:
 
 http://en.wikipedia.org/wiki/Character_encoding ;)
 
 it mentions the difficulty between distinguishing character sets and 
 character encoding.

See also the XML FAQ at http://xml.silmaril.ie/authors/characters/

///Peter



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread christian bindeballe

Edwin Kapauni schrieb:
The first one goes into HTTP response header and is needed by any 
browser to recognize the character encoding of the following content. 
You may run your own test by just omitting it and checking HTTP response 
header of your output.
The second one is telling the serializer which character encoding to use 
for the output.


Hello again!

As I run cocoon inside a Tomcat 5.0.28 installed locally on my machine, 
I am afraid, I feel I am missing something here. Obviously the container 
encoding of my tomcat needs to be ISO-8859-1 and that cannot be changed.


[quote] Since the servlet specification requires that the ISO-8859-1 
encoding is used (by default), you should never change this value unless 
you have a buggy servlet container.[/quote]


So I cannot change the way Tomcat encodes characters, do I get this 
right? Also, but this may be caused by a local installation, the 
response headers don't list any encoding :( nor any charset. This is 
what I get:


Response Headers - http://localhost:8080/copo/portal/portal

X-Cocoon-Version: 2.1.8
Set-Cookie: JSESSIONID=723AA70AA8294D5DC83E78C1BD490B3C; Path=/copo
Cache-Control: no-cache, no-store
Pragma: no-cache
Expires: Thu, 01 Jan 2000 00:00:00 GMT
Content-Type: text/html
Content-Length: 6238
Date: Tue, 17 Jan 2006 14:21:44 GMT
Server: Apache-Coyote/1.1

200 OK

I tried all available sitemaps along the line and entered the parameter 
charset=UTF-8 to the HTML-Serializer of the Base-Sitemap, also to the 
HTML-Include-Serializer, to no avail. I don't suppose this part of the 
response header is not sent, I believe it isn't set. I searched for any 
hints as where I could change that but apart from some API-Docs didn't 
find anything (useful at all). So, if anybody has an idea how I could 
make this happen, I'd be more than grateful for a hint or a solution.


Regards, christian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread Edwin Kapauni

christian bindeballe wrote:
[...]
[quote] Since the servlet specification requires that the ISO-8859-1 
encoding is used (by default), you should never change this value unless 
you have a buggy servlet container.[/quote]


Citation without sources? Where did you get that nonsense from?



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread christian bindeballe

Edwin Kapauni schrieb:

christian bindeballe wrote:
[...]

[quote] Since the servlet specification requires that the ISO-8859-1 
encoding is used (by default), you should never change this value 
unless you have a buggy servlet container.[/quote]



Citation without sources? Where did you get that nonsense from?


I thought I didn't need to mention it, since I think I posted once 
already, and it is there in the web.xml in my WEB-INF directory of the 
cocoon build I use.


also, Marc Portier wrote: (see this thread, message-ID 
[EMAIL PROTECTED])


never change your container-encoding unless you have a servlet container
of which you can specify the used encoding applied in decoding of url's
and request parameters

(if you don't understand what I just said: that translates to simply
never)

confused

christian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread Edwin Kapauni

christian bindeballe wrote:
[...]
also, Marc Portier wrote: (see this thread, message-ID 
[EMAIL PROTECTED])


never change your container-encoding unless you have a servlet container
of which you can specify the used encoding applied in decoding of url's
and request parameters

[...]

Hi Christian,
what I've been writing about is not container-encoding. I was writing 
about character encoding of the documents and about HTTP response header.


And to make even more confusion, there is also form-encoding, 
url-encoding, document-encoding, transmission encoding, ...


As you are going to supply something for web browsers through HTTP, the 
browsers will need something like


Content-Type: text/html;charset=utf-8 

in the HTTP response header. And this is given by the *second* line in 
following serializer configuration:


  map:serializer name=xhtml
mime-type=test/html; charset=utf-8
logger=sitemap.serializer.xhtml
pool-grow=2 pool-max=64 pool-min=2
src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer

Best way to test is from a very minimalistic sample application with 
just this serializer configuration and a short pipeline with only


map:sitemap
  map:components
map:serializers default=xml
  map:serializer name=xhtml
 mime-type=test/html; charset=utf-8
 logger=sitemap.serializer.xhtml
 pool-grow=2 pool-max=64 pool-min=2
 src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer
/map:serializers
  /map:components
  map:pipelines
map:pipeline
  map:match pattern=netzpolitik
map:generate src=http://www.netzpolitik.org/feed/
map:serialize/
  /map:match
/map:pipeline
  /map:pipelines
/map:sitemap

Try this sample and play with mime-type and encoding and watch your 
output.




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread christian bindeballe

Edwin Kapauni schrieb:

  map:serializer name=xhtml
mime-type=test/html; charset=utf-8
logger=sitemap.serializer.xhtml
pool-grow=2 pool-max=64 pool-min=2
src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer

Best way to test is from a very minimalistic sample application with 
just this serializer configuration and a short pipeline with only


map:sitemap
  map:components
map:serializers default=xml
  map:serializer name=xhtml
 mime-type=test/html; charset=utf-8
 logger=sitemap.serializer.xhtml
 pool-grow=2 pool-max=64 pool-min=2
 src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer
/map:serializers
  /map:components
  map:pipelines
map:pipeline
  map:match pattern=netzpolitik
map:generate src=http://www.netzpolitik.org/feed/
map:serialize/
  /map:match
/map:pipeline
  /map:pipelines
/map:sitemap

Try this sample and play with mime-type and encoding and watch your 
output.



ok, the output is fine, when I saw the two little tricks you put in the 
code snippet I figured how it was supposed to work. so used in this 
snippet the charset-thingy works fine in the response headers. now I 
only need to find the final serializer used for the portal and add the 
charset-setting there.


thanks a lot, edwin :)

christian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-17 Thread christian bindeballe
As I thought, it was the html-include serializer in 
{base}/portal/sitemap.xmap that needed some fitting :)


cb

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: XML-Serializer encoding

2006-01-16 Thread Ard Schrijvers
Think you should have no problem at all when you just serialize everything as 
utf-8:

map:serializer logger=sitemap.serializer.xml mime-type=text/xml name=xml 
pool-grow=4 pool-max=32 pool-min=4 
src=org.apache.cocoon.serialization.XMLSerializer
encodingUTF-8/encoding
/map:serializer

AS
  
 
 
 Hi,
 
 I have several newsfeeds that I want to incorporate in my 
 portal, each 
 one of these feeds has its own coplet. but these feeds are encoded 
 differently. some are in ISO-8859-1, others in UTF-8. Now there is no 
 way that I can change the legacy encoding of these. unfortunately it 
 seems that even though I set the encoding of the 
 xml-serializers (in the 
 corresponding pipeline) that I use for those feeds to whatever, the 
 UTF-8-feeds are not displayed properly. is there a way that I 
 can change 
 the encoding in cocoon so the feeds that arrive in encoding 
 a can be 
 changed to encoding b? I wouldn't mind having them all in UTF-8...
 
 any help would be very much appreciated.
 
 best regards,
 
 christian
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-16 Thread christian b
Thank you, Ard

I already did that. But it doesn't change anything. I found this in my
web.xml in the WEB-INF folder of my cocoon-build:

!--
  Set encoding used by the container. If not set the ISO-8859-1 encoding
  will be assumed.
  Since the servlet specification requires that the ISO-8859-1 encoding
  is used (by default), you should never change this value unless
  you have a buggy servlet container.
--
init-param
  param-namecontainer-encoding/param-name
  param-valueISO-8859-1/param-value
/init-param

Servlet-Container used is Tomcat 5.0.28

I switched the encoding parameter to UTF-8 to check whether it would
work, and it seems to. But still the coplets aren't encoded properly.

Then I saw that the whole page is encoded in ISO-8859-1, having been
serialized in HTML (as seen in the doctype of the page). So I looked
for the HTML-Serializer in my portal/sitemap.xmap and changed the
encoding of the html-serializer, too. no difference

these are the feed-adresses that I want to incorporate. both don't
have an encoding set (do RSS-feeds have to have that?) but they
clearly contain UTF-8 encoded characters.

http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
http://www.netzpolitik.org/feed/

so, I guess that somewhere along the line from generating to
serializing these feeds are messed with in a way that the encoding set
in the serializers has no effect whatsoever.

suggestions as to where this could be, anyone?

it would be greatly appreciated :)

regards, christian

2006/1/16, Ard Schrijvers [EMAIL PROTECTED]:
 Think you should have no problem at all when you just serialize everything as 
 utf-8:

 map:serializer logger=sitemap.serializer.xml mime-type=text/xml 
 name=xml pool-grow=4 pool-max=32 pool-min=4 
 src=org.apache.cocoon.serialization.XMLSerializer
 encodingUTF-8/encoding
 /map:serializer

 AS

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-16 Thread Edwin Kapauni

christian b wrote:
[...]

these are the feed-adresses that I want to incorporate. both don't
have an encoding set (do RSS-feeds have to have that?) but they
clearly contain UTF-8 encoded characters.

http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
http://www.netzpolitik.org/feed/


Hi Christian,
Have a look at the HTTP response headers[1] of those feeds. The 
netzpolitik feed's header clearly states it's iso-8859-1.


Recoding will be automagically done on generating xml from that source. 
When you try with the following snippet in your pipeline, your output 
will have a parsing error[2] but its source code will be strictly 
according to encoding settings of your serializer


  map:match pattern=netzpolitik
map:generate src=http://www.netzpolitik.org/feed/
map:serialize/
  /map:match

In case of http://www.netzpolitik.org/feed/ you go in with iso-8859-1 
and come out with utf-8 (if you didn't change the settings of  your 
xml-serializer).


You will also have to make sure that character encoding of your output

 encodingUTF-8/encoding

is in accordance with encoding information sent with e.g.

 mime-type=application/xhtml+xml; charset=utf-8

by your serializer in HTTP response header. The following is an example 
xhtml serializer config having both these informations.


  map:serializer name=xhtml
mime-type=application/xhtml+xml; charset=utf-8
logger=sitemap.serializer.xhtml
pool-grow=2 pool-max=64 pool-min=2
src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer

What generator have you been using for your works. Maybe I didn't fully 
understand your problem ...


[1]http://livehttpheaders.mozdev.org/ for Firefox/Mozilla users

[2]  XML Parsing Error: not well-formed
 Location: http://bodo:8080/netzpolitik
 Line Number 27, Column 17:


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-16 Thread Edwin Kapauni

christian b wrote:
[...]
 these are the feed-adresses that I want to incorporate. both don't
 have an encoding set (do RSS-feeds have to have that?) but they
 clearly contain UTF-8 encoded characters.


http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
 http://www.netzpolitik.org/feed/

Hi Christian,
Have a look at the HTTP response headers[1] of those feeds. The
netzpolitik feed's header clearly states it's iso-8859-1.

Recoding will be automagically done on generating xml from that source.
When you try with the following snippet in your pipeline, your output
will have a parsing error[2] but its source code will be strictly
according to encoding settings of your serializer

  map:match pattern=netzpolitik
map:generate src=http://www.netzpolitik.org/feed/
map:serialize/
  /map:match

In case of http://www.netzpolitik.org/feed/ you go in with iso-8859-1
and come out with utf-8 (if you didn't change the settings of  your
xml-serializer).

You will also have to make sure that character encoding of your output

 encodingUTF-8/encoding

is in accordance with encoding information sent with e.g.

 mime-type=application/xhtml+xml; charset=utf-8

by your serializer in HTTP response header. The following is an example
xhtml serializer config having both these informations.

  map:serializer name=xhtml
mime-type=application/xhtml+xml; charset=utf-8
logger=sitemap.serializer.xhtml
pool-grow=2 pool-max=64 pool-min=2
src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer

What generator have you been using for your works. Maybe I didn't fully
understand your problem ...

[1]http://livehttpheaders.mozdev.org/ for Firefox/Mozilla users

[2]  XML Parsing Error: not well-formed
 Location: http://localhost:8080/netzpolitik
 Line Number 27, Column 17:


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-16 Thread Marc Portier


christian b wrote:
 Thank you, Ard
 
 I already did that. But it doesn't change anything. I found this in my
 web.xml in the WEB-INF folder of my cocoon-build:
 
 !--
   Set encoding used by the container. If not set the ISO-8859-1 encoding
   will be assumed.
   Since the servlet specification requires that the ISO-8859-1 encoding
   is used (by default), you should never change this value unless
   you have a buggy servlet container.
 --
 init-param
   param-namecontainer-encoding/param-name
   param-valueISO-8859-1/param-value
 /init-param
 
 Servlet-Container used is Tomcat 5.0.28
 
 I switched the encoding parameter to UTF-8 to check whether it would
 work, and it seems to. But still the coplets aren't encoded properly.
 

never change your container-encoding unless you have a servlet container
of which you can specify the used encoding applied in decoding of url's
and request parameters

(if you don't understand what I just said: that translates to simply
never)


e.g. when you use jetty (the only one I know) you can specifiy a system
property -Dorg.mortbay.util.URI.charset=utf-8

only then the cocoon servlet init param should be changed to match that


 Then I saw that the whole page is encoded in ISO-8859-1, having been
 serialized in HTML (as seen in the doctype of the page). So I looked
 for the HTML-Serializer in my portal/sitemap.xmap and changed the
 encoding of the html-serializer, too. no difference
 
 these are the feed-adresses that I want to incorporate. both don't
 have an encoding set (do RSS-feeds have to have that?) but they
 clearly contain UTF-8 encoded characters.
 

like where? I just did a rough scan but couldn't find any 'multiple byte
for single character' occurances

note that many 'at first glance odd' characters DO have a valid position
in the ISO-8859-1 encoding

e.g. U+00DF, the typical german LATIN SMALL LETTER SHARP S = Eszett is
just encoded as the single byte hex DF in latin 1

it's not that because a certain character requires 2 bytes in UTF-8
encoding that this character _IS_ an UTF_8 encoded char, the same
character might very well have a valid and usefule single byte latin 1
encoding.

(in other words: the 'encoding' is never a property of the glyph, but I
admit: yeah, some glyphs don't have representations in all encodings)


 http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/

the http header states that the file is iso-8859-1 encoded:
(see the content-type header)

 
 $ wget -S --spider 
 http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
 --15:33:26--  
 http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/
= `index.html'
 Resolving www.industrial-technology-and-witchcraft.de... 212.227.64.59
 Connecting to 
 www.industrial-technology-and-witchcraft.de|212.227.64.59|:80... connected.
 HTTP request sent, awaiting response...
   HTTP/1.0 200 OK
   Date: Mon, 16 Jan 2006 14:33:26 GMT
   Server: Apache/1.3.33 (Unix)
   Cache-Control: no-store, no-cache, must-revalidate, post-check=0, 
 pre-check=0
   Expires: Mon, 16 Jan 2006 13:40:42 GMT
   Pragma: no-cache
   X-Powered-By: PHP/4.4.1
   Set-Cookie: exp_last_visit=822058407; expires=Tue, 16 Jan 2007 14:33:27 
 GMT; path=/
   Set-Cookie: exp_last_activity=1137418407; expires=Tue, 16 Jan 2007 14:33:27 
 GMT; path=/
   Set-Cookie: 
 exp_tracker=a%3A1%3A%7Bi%3A0%3Bs%3A15%3A%22%2FITW%2Fitw-rss20%2F%22%3B%7D; 
 path=/
   Last-Modified: Mon, 16 Jan 2006 12:40:42 GMT
   Content-Type: text/xml; charset=iso-8859-1;
   X-Cache: MISS from proxy2
   X-Cache-Lookup: MISS from proxy2:8080
   Connection: keep-alive
 Length: unspecified [text/xml]
 200 OK

and going with that the feed's xml declaration is nicely claiming the same:

 $ wget -q -O - 
 http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | 
 head -1
 ?xml version=1.0 encoding=iso-8859-1?

at first glance it also looks like a valid claim, with special
characters nicely encoded as XML entities

the ones I found with:
 $ wget -q -O - 
 http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | 
 grep '#'

are all punctuation chars that seem to be correctly applied


 http://www.netzpolitik.org/feed/
 

this one also has ISO_8859_1 encoding according to http header and xml
declaration

so both seem ok

 so, I guess that somewhere along the line from generating to
 serializing these feeds are messed with in a way that the encoding set
 in the serializers has no effect whatsoever.
 
 suggestions as to where this could be, anyone?
 

I have never used coplets, nor even looked at them (deeply sorry)
but I would certainly check the way these feeds are interpreted in the
first place (rather then how they are serialized)

if that is bad, then nothing furtheron in the pipe will be able to
produce decent characterstreams regardless of encoding scheme's you're
trying out on the serializer



so, what do you do 

Re: XML-Serializer encoding

2006-01-16 Thread christian bindeballe

Edwin Kapauni schrieb:

  Hi Christian,
Have a look at the HTTP response headers[1] of those feeds. The 
netzpolitik feed's header clearly states it's iso-8859-1.


That's right. My mistake. I merely deducted the encoding from some 
characters used inside the text of the feeds as for example 8221; which 
are clearly non-Latin-1 characters. Since both feeds have ISO-8859-1 in 
their response headers it means that these feeds are either malformatted 
or malencoded.




Recoding will be automagically done on generating xml from that source. 
When you try with the following snippet in your pipeline, your output 
will have a parsing error[2] but its source code will be strictly 
according to encoding settings of your serializer


  map:match pattern=netzpolitik
map:generate src=http://www.netzpolitik.org/feed/
map:serialize/
  /map:match


That depends on the serializer used. I configured the xml-serializer in 
the sitemap for those feeds to be encoding to UTF-8: no parsing error.


In case of http://www.netzpolitik.org/feed/ you go in with iso-8859-1 
and come out with utf-8 (if you didn't change the settings of  your 
xml-serializer).


You will also have to make sure that character encoding of your output

 encodingUTF-8/encoding

is in accordance with encoding information sent with e.g.

 mime-type=application/xhtml+xml; charset=utf-8


Do you mean I should register the serializer used with both the 
parameter charset and the element encoding corresponding (having the 
same value)?





by your serializer in HTTP response header. The following is an example 
xhtml serializer config having both these informations.


  map:serializer name=xhtml
mime-type=application/xhtml+xml; charset=utf-8
logger=sitemap.serializer.xhtml
pool-grow=2 pool-max=64 pool-min=2
src=org.apache.cocoon.components.serializers.XHTMLSerializer
encodingUTF-8/encoding
indentno/indent
  /map:serializer

What generator have you been using for your works. Maybe I didn't fully 
understand your problem ...


I used both xml and html (to see, if there is any difference in the 
output, but there is none). In the Userdocs it says that you shouldnt't 
use the charset-parameter but rather have the encoding set properly. 
at least this applies to the xml and html-serializers. I used the same 
setting that is in use for the newsfeeds in the sample-portal shipped 
with cocoon.




[1]http://livehttpheaders.mozdev.org/ for Firefox/Mozilla users


Thanks, that was a useful hint. It reminded me of the 
WebDeveloper-Extension I have installed ;)



Best regards, christian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: XML-Serializer encoding

2006-01-16 Thread christian bindeballe

Hello Marc,

Marc Portier schrieb:

never change your container-encoding unless you have a servlet container
of which you can specify the used encoding applied in decoding of url's
and request parameters

(if you don't understand what I just said: that translates to simply
never)


I think I got it :) It also said that in the comments of the web.xml - 
file, as to never change it unless the servlet-container is buggy (which 
I suppose Tomcat 5.0.28 is not), but I thought I might give it a shot. 
But since that didn't help I changed it back to the original setting



like where? I just did a rough scan but couldn't find any 'multiple byte
for single character' occurances


OK, so I belive I got something wrong. These characters that I thought 
to be Unicode-Characters are rather XML-Interpretations?
There are often Chars like #8221; in the feeds. Since these aren't 
translated properly and they are not part of Latin-1 I thought they must 
be UTF-8, which they obviously aren't, or are they?




$ wget -q -O - 
http://www.industrial-technology-and-witchcraft.de/index.php/ITW/itw-rss20/ | grep 
'#'



are all punctuation chars that seem to be correctly applied


see above :) you're more than probably right


I have never used coplets, nor even looked at them (deeply sorry)
but I would certainly check the way these feeds are interpreted in the
first place (rather then how they are serialized)

if that is bad, then nothing furtheron in the pipe will be able to
produce decent characterstreams regardless of encoding scheme's you're
trying out on the serializer


This is the relevant part of my sitemap:
map:match pattern=live.rss
map:generate type=file src={request-param:feed} 
label=content /

map:transform type=xslt src=styles/rss2html.xsl
map:parameter name=fullscreen 
value={coplet:aspectDatas/fullScreen}/

/map:transform
map:serialize type=xml/
/map:match

So my next thought was that it is the XSL that is messing up the RSS.
So I edited the XSL and added this line after the xsl:stylesheet

xsl:output method=html encoding=ISO-8859-1/

but it didn't help either. Maybe someone would like to take a look at 
the xsl I attached to see whether there is something wrong with it?




on the side: you don't need to set your serializer specific encoding if
you have set the form-encoding init param in the web.xml to utf-8 (which
I would suggest at all times)


done.

and thanks a lot for your effort, everybody. I really appreciate that :)

best regards,

christian
?xml version=1.0?
!--
  Copyright 1999-2004 The Apache Software Foundation

  Licensed under the Apache License, Version 2.0 (the License);
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

  http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an AS IS BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
--

xsl:stylesheet version=1.0 xmlns:xsl=http://www.w3.org/1999/XSL/Transform;
xsl:output method=html encoding=ISO-8859-1/
!-- $Id: rss2html.xsl 30932 2004-07-29 17:35:38Z vgritsenko $ 

--
xsl:param name=fullscreen/

xsl:template match=rss
  xsl:apply-templates select=channel/
/xsl:template


xsl:template match=channel

  xsl:if test=title
ba href={link}xsl:value-of select=title//a/b
br/
  /xsl:if
  xsl:if test=description
font size=-3#160;(xsl:value-of select=description/)/font
  /xsl:if
  table
xsl:apply-templates select=item/
  /table
/xsl:template

xsl:template match=item
  !-- Display the first 5 entries --
  xsl:if test=$fullscreen='true' or position() lt; 6
tr
  td
a target=_blank href={link}
  font size=-1 
bxsl:value-of select=title//b
  /font
/a
xsl:apply-templates select=description/
  /td
/tr
trtd height=5#160;/td/tr
  /xsl:if
/xsl:template

xsl:template match=description
  font size=-2
br/
#160;#160;xsl:apply-templates/
  /font
/xsl:template

xsl:template match=node()|@* priority=-1
  xsl:copy
xsl:apply-templates select=@*/
xsl:apply-templates/
  /xsl:copy
/xsl:template

/xsl:stylesheet

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

XML-Serializer encoding

2006-01-15 Thread Christian

Hi,

I have several newsfeeds that I want to incorporate in my portal, each 
one of these feeds has its own coplet. but these feeds are encoded 
differently. some are in ISO-8859-1, others in UTF-8. Now there is no 
way that I can change the legacy encoding of these. unfortunately it 
seems that even though I set the encoding of the xml-serializers (in the 
corresponding pipeline) that I use for those feeds to whatever, the 
UTF-8-feeds are not displayed properly. is there a way that I can change 
the encoding in cocoon so the feeds that arrive in encoding a can be 
changed to encoding b? I wouldn't mind having them all in UTF-8...


any help would be very much appreciated.

best regards,

christian

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]