This subject brings up an interesting idea.

I question the value of any search that returns 100k-200k hits. What is the point?

The question then becomes when is it relevant? It seems that it is only relevant when combined with other terms.

For example, I search for "hurricane katrina" and I get 100k-200k hits. Anything other than the top 1000? are probably irrelevant.

But, you still need to search/score those hits in order to find the top hits.

But, if I search for "hurricane katrina", and "president bush", maybe I only get 1000 documents, and possibly a far different set than the top 1000 when only searching on "hurricane katrina".

It seems that an efficient fix for this would be to add a "relevancy bit" to each document in the posting for the term. It is basically a single bit norm by document & term.

When a query is run, it ignores any document without the relevancy bit set for that document/term in skipTo(), and sets a flag that documents were skipped.

If the query completes without the finding the requested number of documents, and documents were skipped, the query is rerun without the skipping. Also, if during query scoring it seems that the number of documents is not going to be reached, it can disable the skipping at that point, and if reached, re-enable the skipping. In order to make this work, you should score on the least frequent terms first. It would also only check the 'relevancy bit' for high frequency terms.

Has anyone implemented something like this? Thoughts on this?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to