Hi Michael,

I've fooled with this in several forms, always with the same results. My 
current incarnation of the code uses the LSSerializer API. I've also used the 
(deprecated) XMLSerializer. In either case, I've tried StringWriter, 
FileWriter, and ByteArrayOutputStream (then to a FileOutputStream to write to a 
file). I specify UTF-8 as the output encoding. Here's a snippet of the code:

                
System.setProperty(DOMImplementationRegistry.PROPERTY,"org.apache.xerces.dom.DOMImplementationSourceImpl");
                DOMImplementationRegistry registry = 
DOMImplementationRegistry.newInstance();
                DOMImplementation domImpl = registry.getDOMImplementation("LS 
3.0");
                DOMImplementationLS implLS = (DOMImplementationLS)domImpl;
                LSSerializer dom3Writer = implLS.createLSSerializer();
                LSOutput output=implLS.createLSOutput();
                ByteArrayOutputStream bs = new ByteArrayOutputStream();
                output.setByteStream(bs);
                output.setEncoding("UTF-8");
                dom3Writer.write(doc,output);

Here's what get's written to a file from that byte stream:

<test><div>¦º3 times: ÷ ÷ ÷º¬</div><divCDATA><![CDATA[¦º3 times: 
]]>&#xf7;<![CDATA[ ]]>&#xf7;<![CDATA[ 
]]>&#xf7;<![CDATA[º¬]]></divCDATA></test>

Note that the serialized element that is *not* a cdata section converts the 
division symbol to UTF-8 without a problem.

Steve

-----Original Message-----
From: Michael Glavassevich [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 07, 2007 11:04 PM
To: j-users@xerces.apache.org
Cc: Steve Carton
Subject: Re: Split CDATA Sections and the division Symbol (x00f7)

Hi Steve,

"Steve Carton" <[EMAIL PROTECTED]> wrote on 11/06/2007
04:10:45 PM:

> I'm trying to figure out if this is a bug or not. I created a DOM with 
> an element with a CDATA section and I set the value to a String of 
> characters which include a division symbol (xF7). (I actually do this 
> by reading the characters in from a file and converting them from 
> bytes to a String specifying a Windows-1252 encoding.) When I 
> serialize this DOM out to a String, byte array or anything else, the 
> CData section is split around the division symbol and the division 
> symbol is emitted as an entity (&#xF7;). I do try to serialize this as
UTF-8.

Some questions ...

What API are you using for serialization? Are you specifying an output 
encoding? What type of output are you writing to? A java.io.OutputStream? A 
java.io.Writer?

> I see in the documentation that this is the correct behavior when the 
> serializer encounters a Unicode character that isn't recognized; not 
> sure if this means not recognized in the Unicode (internal) form or 
> there is no UTF-8 equivalent. But x00F7 seems to be the correct 
> Unicode value for a division symbol and there is a UTF-8 encoding for 
> it.  Other "special" characters seem to serialize to UTF-8 without 
> this split.

I think what you meant to say here is "not expressible in the output encoding". 
For instance ASCII is only capable of representing Unicode code points from 
0x00-0x7F. 0xF7 isn't representable in ASCII.

> I can send code. I've tried this on the latest Xerces-J. Anyone have 
> any thoughts about it?
>
> Thanks,
>
> Steve Carton

Thanks.

Michael Glavassevich
XML Parser Development
IBM Toronto Lab
E-mail: [EMAIL PROTECTED]
E-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to