Thanks, Erik, but I'm developing this system from scratch as it has
specific use cases including dealing with multiple languages including
multiple forms of a specific minority language (Irish).

I'm going to look at XTF anyway just to see how they managed it!

Thanks,

A.

> Have you looked at XTF?   <http://www.cdlib.org/inside/projects/xtf/>
>
> It does what you're after and much,much more.
>
>       Erik
>
>
> On Aug 13, 2008, at 4:03 AM, [EMAIL PROTECTED] wrote:
>
>> Dear users,
>>
>> Question on approaches to indexing TEI XML or similar section/
>> subsectioned
>> files.
>>
>> I'm indexing TEI P4 XML files using Lucene 2.x.
>>
>> Currently, each TEI XML file corresponds to a Lucene document.
>> I extract the data from each XML file using XPath expressions e.g.
>> for the
>> body text: "/TEI.2/text//p". I also extract and store various meta
>> data
>> e.g. author, title, publishing data etc. per document.
>>
>> The issue is that TEI documents can be very large and contain several
>> chapters. Ideally, search terms would return references to chapter(s)
>> in which the terms were found. The user would then follow a
>> hyperlink to a
>> particular subsection rather than retrieving the entire file.
>>
>> I think it is possible to transform TEI files into chapterised
>> sections
>> using XSLT although I have not managed this yet. The final system
>> is likely to use Apache Cocoon to present documents in various
>> formats but
>> that is a separate issue.
>>
>> I'm tending towards a solution involving indexing each section as a
>> document (possibly with only the front-matter being associated with
>> the
>> meta data e.g. title) and then maybe using XPointer to associate the
>> source document.
>>
>> Any comments/approaches taken to similar issues appreciated.
>>
>> Thanks,
>>
>> Aodh Ó Lionáird.
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to