> The idea is to create a full text index of the alto content, accompanied by 
> the author/title info from the mets file for purposes of results display.

- Then you need to list only alto files in your landscapes entity
(fileName="^ID.{3}-ALTO\d{3}.xml$" or something like that), because
you don't want to index every mets file as a separate solr document,
right?

- Also it seems you might want to try to add regex transformer that
extract ID from avto file name
   <field column="metsId" regex="ID(.{3})-ALTO\d{3}.xml"
sourceColName="${landscapes.fileAbsolutePath} or fileAbsolutePath"/>

- And finally add nested entity to process mets file for every alto record

"<entity name="landscapes" ...>
  <entity name="sample">
    <entity name="metsProcessor"
url="${landscapes.fileAbsolutePath}../ID${sample.metsId}-mets.xml"
processor="XPathEntityProcessor" forEach="/mets"
transformer="TemplateTransformer,RegexTransformer,LogTransformer">"
and extract mets elements/attributes and index them as a separate fields.

P.S. I haven't tried similar scenario, so just speculating

On Fri, Nov 19, 2010 at 12:09 AM, Fred Gilmore <fgilm...@mail.utexas.edu> wrote:
> mets/alto is an xml standard for describing physical objects.  In this case,
> we're describing books.  The mets file holds the metadata (author, title,
> etc.), the alto file is the physical description (words on the page,
> formatting of the page).  So it's a one (mets) to many (alto) relationship.
>
> the directory structure:
>
> /our/collection/IDxxx/:
>
> IDxxx-mets.xml
> ALTO/
>
> /our/collection/IDxxx/ALTO/:
>
> IDxxx-ALTO001.xml
> IDxxx-ALTO002.xml
>
> ie. an xml file per scanned book page.
>
> Beyond the ID number as part of the file names, the mets file contains no
> reference to the alto children.  The alto children do contain a reference to
> the jpg page scan, which is labelled with the ID number as part of the name.
>
> The idea is to create a full text index of the alto content, accompanied by
> the author/title info from the mets file for purposes of results display.
>  The first try with this is attempting a recursive FileDataSource approach.
>
> It was relatively easy to create a "content" field which holds the text of
> the page (each word is actually an attribute of a separate tag), but I'm
> having difficulty determining how I'm going to conditionally add the author
> and title data from the METS file to the rows created with the ALTO content
> field.  It'll involve regex'ing out the ID number associated with both the
> mets and alto filenames for starters, but even at that, I don't see how to
> keep it straight since it's not one mets=one alto and it's also not a static
> string for the entire index.
>
> thanks for any hints you can provide.
>
> Fred
> University of Texas at Austin
> ==========================================
> data-config.xml thus far:
>
> <dataConfig>
> <dataSource type="FileDataSource" />
> <document>
> <entity name="landscapes" rootEntity="false"
> processor="FileListEntityProcessor" fileName=".xml$" recursive="true"
> baseDir="/home/utlol/htdocs/lib-landscapes-new/publications/">
> <entity name="sample" rootEntity="true"
> stream="true"
> pk="filename"
> url="${landscapes.fileAbsolutePath}"
> processor="XPathEntityProcessor"
> forEach="/mets | /alto"
> transformer="TemplateTransformer,RegexTransformer,LogTransformer"
> logTemplate=" processing ${landscapes.fileAbsolutePath}"
> logLevel="info"
>>
>
> <!-- use system filename for getting OCLC number -->
> <!-- we need it both for linking to results and for referencing the METS
> file -->
> <field column="fileAbsPath"     template="${landscapes.fileAbsolutePath}" />
>
>
> <field column="title"
> xpath="/mets/dmdSec/mdWrap/xmlData/mods/titleInfo/title" />
> <!--
> <field column="author"
>  xpath="/mets/dmdSec/mdWrap/xmlData/mods/na...@id='MODSMD_PRINT_N1']/namepa...@type='given']"
> />
> -->
> <field column="filename"
> xpath="/alto/Description/sourceImageInformation/fileName" />
> <field column="content"
> xpath="/alto/Layout/Page/PrintSpace/TextBlock/TextLine/String/@CONTENT" />
> </entity>
> </entity>
> </document>
> </dataConfig>
> ==============================================
> METS example:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <mets xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xmlns="http://www.loc.gov/METS/";
> xsi:schemaLocation="http://www.loc.gov/METS/
> http://schema.ccs-gmbh.com/docworks/version20/mets-docworks.xsd";
> xmlns:MODS="http://www.loc.gov/mods/v3"; xmlns:mix="http://www.loc.gov/mix/";
> xmlns:xlink="http://www.w3.org/1999/xlink"; TYPE="METAe_Monograph"
> LABEL="ENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE- Kingsville
> Area">
> <metsHdr CREATEDATE="2010-05-06T11:21:18" LASTMODDATE="2010-05-06T11:21:18">
> <agent ROLE="CREATOR" TYPE="OTHER" OTHERTYPE="SOFTWARE">
> <name>CCS docWORKS/METAe Version 6.3-0</name>
> <note>docWORKS-ID: 1677</note>
> </agent>
> </metsHdr>
> <dmdSec ID="MODSMD_PRINT">
> <mdWrap MIMETYPE="text/xml" MDTYPE="MODS" LABEL="Bibliographic meta-data of
> the printed version">
> <xmlData>
> <MODS:mods>
> <MODS:titleInfo ID="MODSMD_PRINT_TI1" xml:lang="en">
> <MODS:title>ENVIRONMENTAL GEOLOGIC ATLAS OF THE TEXAS COASTAL ZONE-
> Kingsville Area</MODS:title>
> </MODS:titleInfo>
> <MODS:name ID="MODSMD_PRINT_N1" type="personal">
> <MODS:namePart type="given">L F. Brown, Jr., J. H. McGowen, T. J. Evans, C.
> G.</MODS:namePart>
> <MODS:namePart type="family">Groat</MODS:namePart>
> <MODS:role>
> <MODS:roleTerm>aut</MODS:roleTerm>
> </MODS:role>
> </MODS:name>
> <MODS:name ID="MODSMD_PRINT_N2" type="personal">
> <MODS:namePart type="given">W. L.</MODS:namePart>
> <MODS:namePart type="family">Fisher</MODS:namePart>
> <MODS:role>
> <MODS:roleTerm>aut</MODS:roleTerm>
> </MODS:role>
> </MODS:name>
>
> ============================================
> ALTO example:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <alto xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
> xsi:noNamespaceSchemaLocation="http://schema.ccs-gmbh.com/metae/alto-1-1.xsd";
> xmlns:xlink="http://www.w3.org/TR/xlink";>
> <Description>
> <MeasurementUnit>mm10</MeasurementUnit>
> <sourceImageInformation>
> <fileName>/Docworks/IN/GeologyBooks/txu-oclc-6917337/txu-oclc-6917337-009.jpg</fileName>
> </sourceImageInformation>
> <OCRProcessing ID="OCRPROCESSING_1">
> <preProcessingStep>
> <processingSoftware>
> <softwareCreator>CCS Content Conversion Specialists GmbH,
> Germany</softwareCreator>
> <softwareName>CCS docWORKS</softwareName>
> <softwareVersion>6.3-0.93</softwareVersion>
> </processingSoftware>
> </preProcessingStep>
> <ocrProcessingStep>
> <processingSoftware>
> <softwareCreator>ABBYY (BIT Software), Russia</softwareCreator>
> <softwareName>FineReader</softwareName>
> <softwareVersion>7.0</softwareVersion>
> </processingSoftware>
> </ocrProcessingStep>
> </OCRProcessing>
> </Description>
> <Styles>
> <TextStyle ID="TXT_0" FONTSIZE="11" FONTFAMILY="Times New Roman"/>
> <ParagraphStyle ID="PAR_CENTER" ALIGN="Center"/>
> <ParagraphStyle ID="PAR_BLOCK" ALIGN="Block"/>
> <ParagraphStyle ID="PAR_RIGHT" ALIGN="Right"/>
> <ParagraphStyle ID="PAR_LEFT" ALIGN="Left"/>
> </Styles>
> <Layout>
> <Page ID="P9" PHYSICAL_IMG_NR="9" HEIGHT="2855" WIDTH="2258">
> <TopMargin ID="P9_TM00001" HPOS="0" VPOS="0" WIDTH="2258" HEIGHT="196"/>
> <LeftMargin ID="P9_LM00001" HPOS="0" VPOS="196" WIDTH="151" HEIGHT="2345"/>
> <RightMargin ID="P9_RM00001" HPOS="2104" VPOS="196" WIDTH="154"
> HEIGHT="2345"/>
> <BottomMargin ID="P9_BM00001" HPOS="0" VPOS="2541" WIDTH="2258"
> HEIGHT="314"/>
> <PrintSpace ID="P9_PS00001" HPOS="151" VPOS="196" WIDTH="1953"
> HEIGHT="2345">
> <TextBlock ID="P9_TB00001" HPOS="1045" VPOS="196" WIDTH="173" HEIGHT="28"
> STYLEREFS="TXT_0 PAR_CENTER">
> <TextLine ID="P9_TL00001" HPOS="1045" VPOS="197" WIDTH="173" HEIGHT="27">
> <String ID="P9_ST00001" HPOS="1045" VPOS="197" WIDTH="173" HEIGHT="27"
> CONTENT="Preface" WC="0.98" CC="0000000"/>
> </TextLine>
>
>
>
>

Reply via email to