Re: [MarkLogic Dev General] Problems reading ISO-8859-1 character í

Tim Meagher Mon, 11 Jul 2011 18:29:57 -0700

Hi Danny,


I tried setting the repair option to none.  The error message was more
descriptive.  It was:

 

<error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd"
xmlns:error="http://marklogic.com/xdmp/error";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
  <error:code>XDMP-DOCENTITYREF</error:code>
  <error:name/>
  <error:xquery-version>1.0-ml</error:xquery-version>
  <error:message>Invalid entity reference</error:message>
  <error:format-string>XDMP-DOCENTITYREF: xdmp:document-get("http://......";,
&lt;options
xmlns="xdmp:document-get"&gt;&lt;repair&gt;none&lt;/repair&gt;&lt;encoding&g
t;iso-8859-1&lt;/encoding&gt;&lt;/options&gt;) -- Invalid entity reference
"atilde" at http://....xml line 31</error:format-string>
  <error:retryable>false</error:retryable>
  <error:expr>xdmp:document-get("http://......";, &lt;options
xmlns="xdmp:document-get"&gt;&lt;repair&gt;none&lt;/repair&gt;&lt;encoding&g
t;iso-8859-1&lt;/encoding&gt;&lt;/options&gt;)</error:expr>
  <error:data>
    <error:datum>"atilde"</error:datum>

 

Looks like it first complained about the atilde entity reference.  I don't
understand why that is a problem.

 

BTW, I've been using the full repair option as occasionally I'll get a
document that is not well-formed.  Most of the docs I obtain are utf-8, but
occasionally an ISO-8859-1 doc arrives. I'm just trying to open the sieve
pretty wide.

 

Are the xdmp:tidy options the same in both 4.1 and 4.2?

 

Thank you!

 

Tim

From: [email protected]
[mailto:[email protected]] On Behalf Of Danny Sokolsky
Sent: Monday, July 11, 2011 8:09 PM
To: General MarkLogic Developer Discussion
Subject: Re: [MarkLogic Dev General] Problems reading ISO-8859-1 character
&iacute; 

 

I would set repair to none instead of full.  Repair is really not made for
that kind of cleanup.

 

If you need to clean up the content, use xdmp:tidy instead.

 

The last example in the tidy doc shows how to do that:

 

http://docs.marklogic.com/4.2doc/docapp.xqy#search.xqy?start=1
<http://docs.marklogic.com/4.2doc/docapp.xqy#search.xqy?start=1&cat=all&quer
y=xdmp:tidy&button=search> &cat=all&query=xdmp:tidy&button=search

 

-Danny

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Monday, July 11, 2011 4:53 PM
To: 'General MarkLogic Developer Discussion'
Subject: Re: [MarkLogic Dev General] Problems reading ISO-8859-1 character
&iacute; 

 

Upon further review there appears to be some kind of flaky issue going on.
I can delete parts of the XML document that contain no character entities
and xdmp:document-get() recognizes the &iacute; without error.  I cannot
pinpoint the problem even though I'm using a hex editor to try to identify
problematic characters.  Has anyone experienced anything like this?

 

Thx again,

 

Tim

 

From: [email protected]
[mailto:[email protected]] On Behalf Of Tim Meagher
Sent: Monday, July 11, 2011 6:42 PM
To: 'General MarkLogic Developer Discussion'
Subject: [MarkLogic Dev General] Problems reading ISO-8859-1 character
&iacute; 

 

Hi Folks,

 

I am trying to use MarkLogic to read an XML file from a web page use
xdmp:document-get().  The document is ISO-8895-1 encoded, so my invocation
looks like this:

 

let $url:=
    "http://blah/blah/blah/doc.xml";
let $options :=
    <options xmlns="xdmp:document-get">
        <repair>full</repair>
        <encoding>iso-8859-1</encoding>
    </options>

let $err-message := ""
let $error := false()

let $node :=
    try {
        xdmp:document-get($url, $options)
    }
    catch($e) {(
        xdmp:set($err-message, $e),
        xdmp:set($error, true()),
        xdmp:log(concat("Error getting ", $url, ": ", xdmp:quote($e)))
    )}

return
    if ($error) then $err-message
    else $node

 

The following error is returned:

 

<error:error xsi:schemaLocation="http://marklogic.com/xdmp/error error.xsd
<http://marklogic.com/xdmp/error%20error.xsd> "
xmlns:error="http://marklogic.com/xdmp/error";
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";>
  <error:code>XDMP-DOCUNEOF</error:code>
  <error:name/>
  <error:xquery-version>1.0-ml</error:xquery-version>
  <error:message>XDMP-DOCUNEOF</error:message>
  <error:format-string/>
  <error:retryable>false</error:retryable>

 

I have traced the problem to the use of the ISO-8859-1 character encoding
&iacute; and I get the error even if I replace it with its numeric
equivalent $#237;. Removing the character encoding causes the document to be
read without error even though another ISO-8859-1 character encoding of
&atilde; is handled without error.

 

I'm using MarkLogic 4.1-7.1.

 

Can anyone tell me what's up with this?  From what I can tell &iacute; is a
valid ISO-8859-1 character entity.

Thank you!

 

Tim Meagher

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Problems reading ISO-8859-1 character í

Reply via email to