Hi Leonid, what kind of query is your use case?
Comlex scenario: You need all the hierarchical structure information in one query. This means you want to search with xpath in a real xml-Database. (like: All Documents with a subtitle XY which contains directly after this subtitle a table with the same column like ...) Normal scenario: You want to search for only one part of your hierarchical information like 'Document with word xy in title' and 'Documents with word xy in table'. I am not familiar with lucene use in xml-Databases, but I can advice for "normal scenario": Take a look to the xml-aware search in xtf ( http://xtf.wiki.sourceforge.net/tagRef_textIndexer_PreFilter#toctagRef_textIndexer_PreFilter7 ). The idea is to use one lucene-document for each section with only two fields: "text" and "sectionType". But to collect all hits belonging to one hierarchical information (e.g. one html-File) and compress this to one representative hit in lucene. Best regards Karsten leonardinius wrote: > > Any comments, suggestions? Maybe I should rephrase my original message or > describe it in detail? > I really would like to get any response if possible. > > Thanks a lot in advance! > > On Mon, Sep 1, 2008 at 10:25 AM, Leonid Maslov <[EMAIL PROTECTED]> wrote: > >> Hi all, >> >> First of all, sorry for my poor English. It's not my native language. >> >> I'm trying to use Lucene to index hierarchical kind of information: I >> have >> structured html and pdf/word documents and I want to index them in ways >> to >> perform search in titles, text, paragraphs or tables only, or any >> combinations of items mentioned above. At the moment I see 3 possible >> solutions: >> >> - Create the set of all possible fields, like: contents, title, >> heading, table etc... And index the data in all them accordingly. >> Possible >> impacts: >> - a big count of fields >> - data duplication (because I need to make search looking in the >> paragraphs to look inside all the inner elements, so every outer >> element >> indexed will contain all the inner element content as well) >> - Create the hierarchy of the fields, like "title", "paragraph/title", >> "paragraph/title/subparagraph/table". Possible impacts: >> - count of fields remains the same >> - soft set of fields (not consistent) >> - I'm not sure about the ways I could process required information >> and perform search. >> - Performance issues? >> - Use one field for content and just add location prefix to >> content. >> For example "contents:*paragraph/heading:*token1 token2". * >> paragraph/heading:* here is used as additional information prefix. So, >> I (possibly?) could reuse PrefixQuery functionality or smth. Impacts: >> - Strong set of index fields (small) >> - Additional information processing - all the queries I'll use will >> have to work as PrefixQuery >> - Performance issues? >> >> >> So, have anyone tried to make things work like that? Or am I trying to >> use >> wrench to hammer in nails? I assume Lucene wasn't thought to be used like >> that, but it's worth trying (at least asking). >> Any results / suggestions are welcome! >> >> -- >> Bests regards, >> Leonid Maslov! >> Adrienne Gusoff - "Opportunity knocked. My doorman threw him out." >> > > > > -- > Bests regards, > Leonid Maslov! > Adrienne Gusoff - "Opportunity knocked. My doorman threw him out." > > -- View this message in context: http://www.nabble.com/Newbie-question%3A-using-Lucene-to-index-hierarchical-information.-tp19250038p19266355.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]