Re: [MarkLogic Dev General] CDATA sections being removed from XML documents by Marklogic

Michael Blakeley Thu, 15 Jul 2010 11:13:49 -0700

I don't think you can represent codepoint 24 in well-formed XML, with or 
without a CDATA.


http://www.w3.org/TR/xquery/#doc-xquery-CDataSectionContents defines the 
CDATA section as containing Char, and refers to NT-Char from XML.

[108]           CDataSectionContents       ::=          (Char* - (Char* ']]>' 
Char*))

[157]           Char       ::=          [http://www.w3.org/TR/REC-xml#NT-Char]

http://www.w3.org/TR/REC-xml/#NT-Char

[2]     Char       ::=          #x9 | #xA | #xD | [#x20-#xD7FF] | 
[#xE000-#xFFFD] | [#x10000-#x10FFFF]    /* any Unicode character, excluding 
the surrogate blocks, FFFE, and FFFF. */

Codepoint 24 (0x18) is less that 0x20, and is not 0x9, 0xa, or 0xd. So 
as I read the spec, codepoint 24 (0x18) isn't allowed in CDATA. If 
that's correct, then it may be a bug that the server allowed your test 
case to run at all.

To check this with another tool, I placed a test XML doc in a file and 
tried xmllint:

$ xmllint /tmp/cp24
/tmp/cp24:1: parser error : CData section not finished
This is a tes
<doc><![CDATA[This is a test ▒ if not escaped, this should not work]]></doc>
                              ^
/tmp/cp24:1: parser error : PCDATA invalid Char value 24
...

So xmllint seems to agree with my interpretation of the W3C recommendations.

For more discussion of CDATA and MarkLogic Server, you might find 
http://marklogic.markmail.org/search/?q=cdata interesting.

-- Mike

On 2010-07-15 10:32, cashatzer-markm...@yahoo.com wrote:
> We are having a problem where Marklogic appears to be removing the CDATA 
> sections that we have wrapped our text elements with.   This is causing 
> Marklogic replication to fail for documents with content that needs to be 
> escaped; the document is saved successfully in the "source" database, but 
> when it goes to replicate a document containing content that must be escaped 
> to a "target" database, it fails.
>
> Below is an example of what I am referring to.  NOTE that there is SUPPOSED 
> to be a non-printable character between "test" and "if" in the text below.   
> This non-printable character is being converted to&#24; by Marklogic.    When 
> Marklogic replication tries to send that text over to the target, it fails 
> with a similar error (XDMP-DOCCHARREF) to the error shown below.   I can send 
> a text file containing this text with the non-printable character if needed 
> to duplicate this problem.
>
> Why is MarkLogic stripping the CDATA sections?  It should not do this.   
> Applications should not have to reprocess their documents to put CDATA 
> sections around text that was already wrapped previously.
>
>
> EXAMPLE:
>
> cqsh>  xdmp:document-insert("testcdata.xml",
>     ->  <doc><![CDATA[This is a test   if not escaped, this should not 
> work]]></doc>);
> Done (0.04 sec)
>
> cqsh>  for $i in //doc return $i;
> <doc>This is a test&#24;  if not escaped, this should not work</doc>
> Done (0.01 sec)
> cqsh>  xdmp:document-insert("testcdata.xml",
>     ->  <doc>This is a test&#24;  if not escaped, this should not work</doc>);
> ---------------------------
>        XQuery Error
> ---------------------------
> Message: XDMP-CHARREF: (err:XPST0003) Invalid character reference "24  if not 
> escaped, this should not work"
> --STACK DUMP--
> line number:  1
> context item:  null
> context position:  0
> uri:  /eval
> variable bindings:
>


_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] CDATA sections being removed from XML documents by Marklogic

Reply via email to