Newbie question: using Lucene to index hierarchical information.

Leonid Maslov Mon, 01 Sep 2008 00:26:18 -0700

Hi all,

First of all, sorry for my poor English. It's not my native language.


I'm trying to use Lucene to index hierarchical kind of information: I have
structured html and pdf/word documents and I want to index them in ways to
perform search in titles, text, paragraphs or tables only, or any
combinations of items mentioned above. At the moment I see 3 possible
solutions:

   - Create the set of all possible fields, like: contents, title, heading,
   table etc... And index the data in all them accordingly. Possible impacts:
   - a big count of fields
      - data duplication (because I need to make search looking in the
      paragraphs to look inside all the inner elements, so every outer element
      indexed will contain all the inner element content as well)
   - Create the hierarchy of the fields, like "title", "paragraph/title",
   "paragraph/title/subparagraph/table". Possible impacts:
      - count of fields remains the same
      - soft set of fields (not consistent)
      - I'm not sure about the ways I could process required information and
      perform search.
      - Performance issues?
      - Use one field for content and just add location prefix to content.
   For example "contents:*paragraph/heading:*token1 token2". *
   paragraph/heading:* here is used as additional information prefix. So, I
   (possibly?) could reuse PrefixQuery functionality or smth. Impacts:
      - Strong set of index fields (small)
      - Additional information processing - all the queries I'll use will
      have to work as PrefixQuery
      - Performance issues?


So, have anyone tried to make things work like that? Or am I trying to use
wrench to hammer in nails? I assume Lucene wasn't thought to be used like
that, but it's worth trying (at least asking).
Any results / suggestions are welcome!

-- 
Bests regards,
Leonid Maslov!
Adrienne Gusoff  - "Opportunity knocked. My doorman threw him out."

Newbie question: using Lucene to index hierarchical information.

Reply via email to