Dear users,

Question on approaches to indexing TEI XML or similar section/subsectioned
files.

I'm indexing TEI P4 XML files using Lucene 2.x.

Currently, each TEI XML file corresponds to a Lucene document.
I extract the data from each XML file using XPath expressions e.g. for the
body text: "/TEI.2/text//p". I also extract and store various meta data
e.g. author, title, publishing data etc. per document.

The issue is that TEI documents can be very large and contain several
chapters. Ideally, search terms would return references to chapter(s)
in which the terms were found. The user would then follow a hyperlink to a
particular subsection rather than retrieving the entire file.

I think it is possible to transform TEI files into chapterised sections
using XSLT although I have not managed this yet. The final system
is likely to use Apache Cocoon to present documents in various formats but
that is a separate issue.

I'm tending towards a solution involving indexing each section as a
document (possibly with only the front-matter being associated with the
meta data e.g. title) and then maybe using XPointer to associate the
source document.

Any comments/approaches taken to similar issues appreciated.

Thanks,

Aodh Ó Lionáird.





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to