Re: Proposal: extracting term-level stats from query process

Doug Cutting Thu, 11 Mar 2004 09:56:41 -0800

[EMAIL PROTECTED] wrote:

I think the TermScorer could be used to produce some useful feedback on performance of 
terms used in queries with the addition of some new methods:
int getNumDocMatches();

Is this just IndexReader#docFreq(Term), or is the sum of all of the TermDocs#freq() for the term?

float getAverageScore();

Would the average really that useful? This could the same for a term which has ten very strong matches and ninety very weak matches as for a term that has 100 middling matches.

These could be used in the following scenarios:
* selecting which terms to offer spelling correction on (when numDocMatches==0)

Would the above be better than IndexReader#docFreq(Term) for this?

* influencing the highlighter selections (doc fragments scored based on contained term weights)

I don't see how the above would help here. The ideal way to score fragments would be to create an index (e.g., using a RAMDirectory) of fragments, then search this with the query to find the top matches. One can approximate this more efficiently by looking for fragments with a high density of query terms, perhaps taking idf's into account.

* For "more like this" natural language type queries the highlighter could highlight only 
"significantly" scored terms and
ignore low-scoring noise words.

The best method to identify significant words is with Similarity#idf(Term,Searcher). Significant words have higher idfs, noise words have lower idfs.

I know it would be possible to derive all this information using existing APIs but it would effectively involve another pass of the same index data.

Unless I am mistaken, I think most of what you're after can be accomplished with only another access to the term dictionary data, and does not require another pass over, e.g., the TermDocs.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal: extracting term-level stats from query process

Reply via email to