Hi everyone! This is my first post here and I'm new to Lucene, so I would
appreciate your ideas with the design of lucene document I came up with.
*What is my goal*
I'm trying to index the collection of xml documents and all have the same
structure like this:
Each <section> tag can itself have <sections> tag which itself has <section>
tags and so on. The maximum depth is 3.
<doc>
<title>
</title>
<sections>
<section>
<title>
<text>
</section>
</sections>
</doc>
So, I figured out to have these separate fields:
"pageTitle" - doc/title
"sectionTitle" - doc/sections/section/title
"sectionText" - doc/sections/section/text
"subSectionTitle" - doc/sections/section/sections/section/title
"subSectionText" - doc/sections/section/sections/section/text
"subSubSectionTitle" - ...
"subSubSectionText" - ...
Currently, as I index, each document is a separate sectiontext, sectiontitle
or sub things, but they all have the same pageTitle field of course. For
searching, is that the good approach to index the document? I will describe
below *how I'm going to search*;
The real page/document structure is like this: pageTitle is the disease name
and e.g sectionTitle can be "Definition" or "Treatment" or something like
that. So, when the user asks a question like: "What are the treatments for
"x" disease?" - I'm classifying that the questions is "treatment" type, so
I would like to search the disease name in lucene index, but I would like to
specifically retrieve the section of which title is "treatment".
Is that the good indexing approach? And also, how would you recommend me to
construct a query for searching, because I want to give disease name more
importance and type ("treatment") relatively less.
Thanks in advance!
--
View this message in context:
http://lucene.472066.n3.nabble.com/Help-with-document-design-for-indexing-searching-tp4075228.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]