Doug Cutting wrote:
Andrzej Bialecki wrote:
Using the original index, it was possible for pages with high tf/idf
of a term, but with a low "boost" value (the OPIC score), to outrank
pages with high "boost" but lower tf/idf of a term. This phenomenon
leads quite often to results that are perceived as "junk", e.g. pages
with a lot of repeated terms, but with little other real content,
like for example navigation bars.
Sounds like tf/idf might be de-emphasized in scoring. Perhaps
NutchSimilarity.tf() should use log() instead of sqrt() when
field==content?
I don't think it's that simple, the OPIC score is what determined this
behaviour, and it doesn't correspond at all to tf/idf, but to a human
judgement.
To conclude, I will add the IndexSorter.java to the core classes, and
I suggest to continue the experiments ...
I've updated the version of Lucene included with Nutch to have the
required patch. Would you like me to commit IndexSorter.java or would
you?
Please do it. There are two typos in your version of IndexSorter, you
used numDocs() in two places instead of maxDoc(), which for indexes with
deleted docs (after dedup) leads to exceptions.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com