[Nutch-dev] Re: IndexSorter optimizer

Doug Cutting Mon, 02 Jan 2006 14:44:05 -0800

Andrzej Bialecki wrote:

Using the original index, it was possible for pages with high tf/idf ofa term, but with a low "boost" value (the OPIC score), to outrank pageswith high "boost" but lower tf/idf of a term. This phenomenon leadsquite often to results that are perceived as "junk", e.g. pages with alot of repeated terms, but with little other real content, like forexample navigation bars.

Sounds like tf/idf might be de-emphasized in scoring. PerhapsNutchSimilarity.tf() should use log() instead of sqrt() when field==content?

To conclude, I will add the IndexSorter.java to the core classes, and Isuggest to continue the experiments ...

I've updated the version of Lucene included with Nutch to have therequired patch. Would you like me to commit IndexSorter.java or would you?


Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: IndexSorter optimizer

Reply via email to