On Mar 29, 2006, at 5:09 AM, Marios Skounakis wrote:
I am executing searches that return between 2000 and 10000
documents and sorting the results by relevance (or sometimes
alphabetically).
In every query, I need to discard some of the results based on
their docId. I have a list of the docIds that need to be discarded
in an array. The size of the list is usually about 100.
I generally perform paging, so I don't need to display all the
results, but only those of the current page (e.g. results 100 to 120)
Currently, after getting the Hits object from the Searcher, I loop
over the documents, retrieve their docId, and see if it is in the
discard list. If not, I put the document in a validResult
collection. When I have read enough valid results to be able to
show the current page, I stop. E.g., in the above example, I would
read until I had 120 valid results.
The problem with this implementation is that I have to read the
docId for the results that precede the current page, in order to
determine if they are in the invalid list. So, when showing the
pages near the last page, I have to read the entire result list. I
don't like this because this means that the computation required to
display a page is not constant but depends on the page's position.
Anyway, what do the experts think? Is this implementation
prohibitively expensive?
Yes, that implementation is going to be prohibitively expensive. The
recommended way to employ a search constraint as you described is to
use a Filter. You could write a custom Filter that would use your
discarded docId array, and turn off those documents from the BitSet,
and turn on all the rest. Then you will use the search(Query,
Filter) method signature for searching.
(As a sidenote, when calling hits.doc(i), does Lucene retrieve the
whole document, or just a pointer to it, and retrieves the data
when doing hits.doc(i).getField...?)
It retrieves the whole document currently.
An alternative would be to extend the query to exclude the ids in
the discard list. How would adding 100 exclusion clauses to the
query impact the query's performance? Are there any studies on
search speed in relation to the number of query clauses?
Adding them to the query would certainly work but a Filter is a
cleaner and faster way for your scenario.
Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]