Re: IndexSorter optimizer

Andrzej Bialecki Mon, 02 Jan 2006 14:49:48 -0800

Doug Cutting wrote:

Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idfof a term, but with a low "boost" value (the OPIC score), to outrankpages with high "boost" but lower tf/idf of a term. This phenomenonleads quite often to results that are perceived as "junk", e.g. pageswith a lot of repeated terms, but with little other real content,like for example navigation bars.
Sounds like tf/idf might be de-emphasized in scoring. PerhapsNutchSimilarity.tf() should use log() instead of sqrt() whenfield==content?

I don't think it's that simple, the OPIC score is what determined thisbehaviour, and it doesn't correspond at all to tf/idf, but to a humanjudgement.

To conclude, I will add the IndexSorter.java to the core classes, andI suggest to continue the experiments ...
I've updated the version of Lucene included with Nutch to have therequired patch. Would you like me to commit IndexSorter.java or wouldyou?

Please do it. There are two typos in your version of IndexSorter, youused numDocs() in two places instead of maxDoc(), which for indexes withdeleted docs (after dedup) leads to exceptions.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: IndexSorter optimizer

Reply via email to