Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf of a term, but with a low "boost" value (the OPIC score), to outrank pages with high "boost" but lower tf/idf of a term. This phenomenon leads quite often to results that are perceived as "junk", e.g. pages with a lot of repeated terms, but with little other real content, like for example navigation bars.
Sounds like tf/idf might be de-emphasized in scoring. Perhaps NutchSimilarity.tf() should use log() instead of sqrt() when field==content?
To conclude, I will add the IndexSorter.java to the core classes, and I suggest to continue the experiments ...
I've updated the version of Lucene included with Nutch to have the required patch. Would you like me to commit IndexSorter.java or would you?
Doug ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
