Isn't the trouble with introducing a scoring threshold based on raw
scores that the Similarity scoring mechanism is considering each
document in isolation? At this stage we don't know if the query is
generally a good one or not (ie spelt correctly, and not a Googlewhack
combination of rarely colocated terms). For example, we dont know if,
in general, the coord factor was very poor for all docs and so our score
threshold used by each doc should be relaxed as a consequence.
A simple solution may be to delay thresholding until all results are in
and to consider the top result as the "best you can get" for the given
query ie "100%" and setting the threshold for accepting other results at
something like 70% of the top score.
This too has its faults: I've found it useful to consider examples of
different queries and the distribution of their (normalized) scores.
* GoogleWhack query (rare or misspelt terms - hi idf, low coord- only
one result with ALL terms)
[octupus jacuzzi tango]
1, 0.30, 0.30, 0.25, 0.25
* Very rare query (rare or misspelt terms - hi idf, very low coord- NO
result with ALL terms)
[octupus jacuzzi unicycle]
1, 0.90, 0.88, 0.88, 0.88
* Good query (some rarer terms maybe some common - but several docs
contain > 1 of the rarer terms)
[installing a jacuzzi in the home]
1, 90, 80, 78, 30, 20
* Too-common query (many common terms - results have hi coord but low idf):
[home page of the web site]
1, 0.99, 0.99, 0,98, 0.93
Looking at these normalized scores I suspect this "70% of top" rule
doesn't work well in all cases. Maybe a better solution lies in mixing
the "% of top" rule with the raw-scores thresholds somehow.
___________________________________________________________
How much free photo storage do you get? Store your holiday
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]