Thanks, Erik, but I'm developing this system from scratch as it has specific use cases including dealing with multiple languages including multiple forms of a specific minority language (Irish).
I'm going to look at XTF anyway just to see how they managed it! Thanks, A. > Have you looked at XTF? <http://www.cdlib.org/inside/projects/xtf/> > > It does what you're after and much,much more. > > Erik > > > On Aug 13, 2008, at 4:03 AM, [EMAIL PROTECTED] wrote: > >> Dear users, >> >> Question on approaches to indexing TEI XML or similar section/ >> subsectioned >> files. >> >> I'm indexing TEI P4 XML files using Lucene 2.x. >> >> Currently, each TEI XML file corresponds to a Lucene document. >> I extract the data from each XML file using XPath expressions e.g. >> for the >> body text: "/TEI.2/text//p". I also extract and store various meta >> data >> e.g. author, title, publishing data etc. per document. >> >> The issue is that TEI documents can be very large and contain several >> chapters. Ideally, search terms would return references to chapter(s) >> in which the terms were found. The user would then follow a >> hyperlink to a >> particular subsection rather than retrieving the entire file. >> >> I think it is possible to transform TEI files into chapterised >> sections >> using XSLT although I have not managed this yet. The final system >> is likely to use Apache Cocoon to present documents in various >> formats but >> that is a separate issue. >> >> I'm tending towards a solution involving indexing each section as a >> document (possibly with only the front-matter being associated with >> the >> meta data e.g. title) and then maybe using XPointer to associate the >> source document. >> >> Any comments/approaches taken to similar issues appreciated. >> >> Thanks, >> >> Aodh Ó Lionáird. >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]