Re: Reducing number of poor results from large BooleanQueries

markharw00d Fri, 09 Sep 2005 17:45:19 -0700

Isn't the trouble with introducing a scoring threshold based on rawscores that the Similarity scoring mechanism is considering eachdocument in isolation? At this stage we don't know if the query isgenerally a good one or not (ie spelt correctly, and not a Googlewhackcombination of rarely colocated terms). For example, we dont know if,in general, the coord factor was very poor for all docs and so our scorethreshold used by each doc should be relaxed as a consequence.

A simple solution may be to delay thresholding until all results are inand to consider the top result as the "best you can get" for the givenquery ie "100%" and setting the threshold for accepting other results atsomething like 70% of the top score.

This too has its faults: I've found it useful to consider examples ofdifferent queries and the distribution of their (normalized) scores.

* GoogleWhack query (rare or misspelt terms - hi idf, low coord- onlyone result with ALL terms)

[octupus jacuzzi tango]
1, 0.30, 0.30, 0.25, 0.25

* Very rare query (rare or misspelt terms - hi idf, very low coord- NOresult with ALL terms)

[octupus jacuzzi unicycle]
1, 0.90, 0.88, 0.88, 0.88

* Good query (some rarer terms maybe some common - but several docscontain > 1 of the rarer terms)

[installing a jacuzzi in the home]
1, 90, 80, 78, 30, 20

* Too-common query (many common terms - results have hi coord but low idf):
[home page of the web site]
1, 0.99, 0.99, 0,98, 0.93

Looking at these normalized scores I suspect this "70% of top" ruledoesn't work well in all cases. Maybe a better solution lies in mixingthe "% of top" rule with the raw-scores thresholds somehow.

___________________________________________________________How much free photo storage do you get? Store your holidaysnaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Reducing number of poor results from large BooleanQueries

Reply via email to