I have a number of documents (sample below) in XML format but not UTF-8 encoding and with an externally referenced DTD and rendering stylesheet. What's the best way to get these documents into MarkLogic so that:

- the encoding is changed to UTF-8
- any entities in the DTD are resolved to UTF-8 encoded characters
- any CDATA sections are removed with the content left intact, including markup embedded in the CDATA content

Do I need to pre-process the files before loading or can Mark logic handle these kinds of conversion as part of the load functions?

Also, does anyone know of any good strategies for converting math in TeX format to MathML?

Thanks,

Alan

Alan Darnell
University of Toronto



<?xml version="1.0" encoding="iso-8859-1"?><?xml-stylesheet type="text/xsl" href="file://batchgate1\StyleS\bpg4
0.xsl"?>
<!DOCTYPE content PUBLIC "-//BLACKWELL PUBLISHING GROUP//DTD 4.0//EN" "\\Batchgate1\bpgdtd\4-0\bpg4-0.dtd">
<content dtdver="4.0" docfmt="xml">
        <publisherinfo>
                <publisher>Blackwell Publishing Ltd</publisher>
                <address format="inline">Oxford, UK</address>
        </publisherinfo>
        <contentinfo type="journal" language="en">
                <contentcode>JTH</contentcode>
                <titlegroup>
<title type="journal">Journal of Thrombosis and Haemostasis</title>
                </titlegroup>
                <issn>1538-7933</issn>
<copyright>2006 International Society on Thrombosis and Haemostasis</copyright>
        </contentinfo>
<document type="primary_article" sequence="1" referencetype="vancouver">
                <header>
                        <documentinfo language="en">
                                <idgroup>
<documentid type="doi" id="10.1111/j.1538-7836.2006.02000.x" status="live" /> <documentid type="bpl" id="2000" status="live" /> <documentid type="version" id="fi" status="live" />
                                </idgroup>
                                <relatedgroup>
<related relationship="child" type="object" /> <related relationship="self" type="primary_article"> <file name="jth_2000.xml" type="xml" />
                                        </related>
<related relationship="sibling" type="pages"> <file name="jth_2000.pdf" type="pdf" />
                                        </related>
                                </relatedgroup>
                                <date date="2006-08">August 2006</date>
                                <pagedetails>
                                        <volume>4</volume>
                                        <issue sequence="15">8</issue>
                                        <page type="first">1747</page>
                                        <page type="last">1755</page>
                                </pagedetails>
                                <countgroup>
<count type="figure_total" count="6" /> <count type="table_total" count="3" /> <count type="page_total" count="9" />
                                </countgroup>
                                <trackinghistory>
<trackingdate type="created" date="2006-04-25" /> <trackingdate type="markedup" date="0000" by="SPS" software="preediting tool" version="4.0" /> <trackingdate type="paginated" date="0000" by="FSS_SPS" /> <trackingdate type="received" date="0000" /> <trackingdate type="revised" date="0000"/> <trackingdate type="accepted" date="0000" /> <trackingdate type="Delivered as FI" date="20060825" />
                                </trackinghistory>
<tocheading level="1">ORIGINAL ARTICLES</tocheading> <tocheading level="2"><i>Coagulation</i>
                                </tocheading>
                                <runningheadgroup>
<runninghead type="title"><i>Tissue factor antigen in plasma</i>
                                        </runninghead>
<runninghead type="author"><i>B. Parhami-Seren</i> et&nbsp;al
                                        </runninghead>
                                </runningheadgroup>
                        </documentinfo>
                        <history>
<p>Received 23 February 2006, accepted 12 April 2006</p>
                        </history>
                        <footnotegroup>
<correspondent id="c1">Behnaz Parhami-Seren, Department of Biochemistry, College of Medicine, University of Vermont, 208 South Park Drive, Cholchester, VT 05446-0068, USA.<br />Tel.:+1&nbsp;802&nbsp;656&nbsp;3286; fax: +1&nbsp;802&nbsp;656&nbsp;2256; e-mail: <externallink type="email">[EMAIL PROTECTED]</externallink>
                                </correspondent>
                        </footnotegroup>
                        <titlegroup>
<title type="surtitle">ORIGINAL ARTICLE</title> <title type="document">Immunologic quantitationof tissue factors</title>
                        </titlegroup>
                        <namegroup type="author">
                                <name type="author">
<forenames>B.</forenames><x> </x> <surname>PARHAMI-SEREN</ surname>
                                </name><x>, </x>
                                <name type="author">
<forenames>S.</forenames><x> </x>
                                        <surname>BUTENAS</surname>
                                </name><x>, </x>
                                <name type="author">
<forenames>J.</forenames><x> </x> <surname>KRUDYSZ-AMBLO</ surname>
                                </name><x> and </x>
                                <name type="author">
<forenames>K. G.</ forenames><x> </x>
                                        <surname>MANN</surname>
                                </name>
<address format="inline">Department of Biochemistry, College of Medicine, University of Vermont, Burlington, VT, USA</address>
                        </namegroup>
                        <summary language="en">
<heading implicit="yes" id="h1" level="5" format="inline">Summary.&ensp;</heading> <p>The large number of conflicting reports on th e presence and concentration of circulating tissue factor (TF) in blood generate s uncertainties regarding its relevance to hemostasis and association with speci fic diseases. We believe that the source of these controversies lies in part in the assays used for TF quantitation. We have developed a highly sensitive and sp
...
                                </p>
                        </summary>
                        <keywordgroup language="en" format="display">
<heading implicit="yes" id="h2" level="5" format="inline">Keywords:&ensp;</heading> <keyword>fluorescence immunoassay</ keyword><x>,</x>
                                <keyword>placenta</keyword><x>, </x>
                                <keyword>plasma</keyword><x>, </x>
<keyword>tissue factor</ keyword><x>.</x>
                        </keywordgroup>
                </header>
        </document>
</content>


_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Reply via email to