I am trying to make some high- (and not so high) level design decisions for my 
app that is supposed to check a collection of documents against a set of 
terms/queries. Basically, I need to perform a triage of sorts when I would find 
only those docs in the collection which have occurrences of at least one term 
from the term list. For those docs, I also need to find where in the document 
each occurrence is, since I then need to collect a small amount of surrounding 
text for a more detailed analysis.

Clearly, I will need to index the document collection using indexing classes of 
Lucene. This is pretty straighforward. 

Then I will need to use the highlighting classes. In some sample cose I found 
online, a query is first searched for and hits are returned. Then docids are 
extracted for the hits and query is highlighted. Some questions:

Q1: Does Lucene perform essentially the same searching operation twice, first 
to find hits, then to highlight? If so, does this mean that if I expect most of 
the docs in my collection to contain at least one of the search terms, it might 
be faster for me to skip searching and simply go over all docs, applying 
highlighting? Then for those docs where no hits occurred I would simply get an 
empty list of relevant fragments. 

Q2: Is the same scoring mechanism used during search and during highlighting? 
That is, can I be sure that if I get a hit during search, the corresponding 
document indeed contains my query that will then be found dyuring highlighting?

Q3: Are there any mechanisms in Lucene that would facilitate merging of 
highlighting results for two different queries against a single document? 

Q4: I did some small tests of highlighting and noticed that some of the 
fragments returned for a query contained highlighted text that was quite far 
from the original query. For instance, I was looking for a 3-word term and it 
highlighted a sequence of only 2 of these 3 words. How can I control how close 
highlighted fragments should be to the original query?



Thanks much,

Ilya Zavorin



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to