Isn't the trouble with introducing a scoring threshold based on raw scores that the Similarity scoring mechanism is considering each document in isolation? At this stage we don't know if the query is generally a good one or not (ie spelt correctly, and not a Googlewhack combination of rarely colocated terms). For example, we dont know if, in general, the coord factor was very poor for all docs and so our score threshold used by each doc should be relaxed as a consequence.

A simple solution may be to delay thresholding until all results are in and to consider the top result as the "best you can get" for the given query ie "100%" and setting the threshold for accepting other results at something like 70% of the top score.

This too has its faults: I've found it useful to consider examples of different queries and the distribution of their (normalized) scores.

* GoogleWhack query (rare or misspelt terms - hi idf, low coord- only one result with ALL terms)
[octupus jacuzzi tango]
1, 0.30, 0.30, 0.25, 0.25

* Very rare query (rare or misspelt terms - hi idf, very low coord- NO result with ALL terms)
[octupus jacuzzi unicycle]
1, 0.90, 0.88, 0.88, 0.88

* Good query (some rarer terms maybe some common - but several docs contain > 1 of the rarer terms)
[installing a jacuzzi in the home]
1, 90, 80, 78, 30, 20

* Too-common query (many common terms - results have hi coord but low idf):
[home page of the web site]
1, 0.99, 0.99, 0,98, 0.93

Looking at these normalized scores I suspect this "70% of top" rule doesn't work well in all cases. Maybe a better solution lies in mixing the "% of top" rule with the raw-scores thresholds somehow.


                
___________________________________________________________ How much free photo storage do you get? Store your holiday snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to