Re: [MarkLogic Dev General] How to handle named HTMLcharacter entities when loading an ISO-8859-1 encoded document into MarkLogic?

Tim Meagher Mon, 05 Jul 2010 03:58:48 -0700

Hi Geert,


Interesting.  I checked into the document and noticed that it references a
DTD that references entities defined in files separate from the DTD.

 

Thanks,

 

Tim

 

-----Original Message-----
From: general-boun...@developer.marklogic.com
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Geert Josten
Sent: Monday, July 05, 2010 6:43 AM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] How to handle named HTMLcharacter
entities when loading an ISO-8859-1 encoded document into MarkLogic?

 

Hi Tim,

 

To my knowledge, MarkLogic Server only accepts the five default XML named
entity (lt, gt, amp, apos, quot) by default, and any other named entities
added to the local declaration subset. External declarations are ignored.
>From the top of my head the local declaration should look something like the
following, add it directly after the XML declaration:

 

<!DOCTYPE {name_of_root} PUBLIC "some_pub_id" [

 

<!ENTITY sim CDATA "&#x0223C;">

 

]>

 

It might be easier though to put a proxy-service in between (if possible),
that normalizes encoding, as well as resolves these entities (which usually
only requires parsing the XML with a DTD declaration)..

 

Kind regards,

Geert

 

> 

 

 

drs. G.P.H. (Geert) Josten

Consultant

 

Daidalos BV

Hoekeindsehof 1-4

2665 JZ Bleiswijk

 

T +31 (0)10 850 1200

F +31 (0)10 850 1199

 

mailto:geert.jos...@daidalos.nl

http://www.daidalos.nl/

 

KvK 27164984

 

 

De informatie - verzonden in of met dit e-mailbericht - is afkomstig van
Daidalos BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit
bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan
dit bericht kunnen geen rechten worden ontleend.

 

> From: general-boun...@developer.marklogic.com

> [mailto:general-boun...@developer.marklogic.com] On Behalf Of

> Tim Meagher

> Sent: maandag 5 juli 2010 12:21

> To: 'General Mark Logic Developer Discussion'

> Subject: [MarkLogic Dev General] How to handle named

> HTMLcharacter entities when loading an ISO-8859-1 encoded

> document into MarkLogic?

> 

> Hi Folks,

> 

> 

> 

> I am using xdmp:document-load to insert content into

> MarkLogic.  Until recently I had only been loading UTF-8 XML

> into the database, but recently started encountering some

> ISO-8859-1 encoded content.  I was able to adjust the

> xdmp:document-load options to accommodate ISO-8859-1 and for

> the most part it has been working okay; however, the

> ISO-8859-1 content occasionally includes HTML character

> entities such as &sim; which appears to be causing the load

> to fail (which subsequently is generating an XDMP-DOCUNEOF

> error message when the error is not trapped with a try-catch

> block but generates an XDMP-DOCENTITYREF error message when

> the error is trapped with a try-catch block).

> 

> 

> 

> Is there a simple way to add a list of character entity

> mappings to get this to work?  For example, I've read that

> &sim; maps to the Unicode character U+0223C

> <http://www.fileformat.info/info/unicode/char/223c/index.htm>

>  (http://code.google.com/p/doctype/wiki/SimCharacterEntity).

> 

> 

> 

> Thanks ahead of time for any help with this!

> 

> 

> 

> Tim Meagher

> 

> 

> 

> 

_______________________________________________

General mailing list

General@developer.marklogic.com

http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] How to handle named HTMLcharacter entities when loading an ISO-8859-1 encoded document into MarkLogic?

Reply via email to