Insurance Squared Inc. wrote:
I'm trying to get rid of some spammy sites in our index.
First, I wonder if anyone has any suggestions on changes to the
default install config of Nutch that will help drive better sites to
the top and spammier sites down.
What is a "better" site? Depending on how you define this, and how
precise is your definition, you should get clear indications how to
improve the quality.
Secondly, I boosted the inbound anchor text config - but if anything
that made things worse. A lot of the spammier sites heavily use
search terms intheir internal anchors. So I'm wondering - is there
any easy way to distinguish between anchor text from within the same
domain vs. anchor text from external domains, and give them different
weightings? I expect this isn't the case currently - anyone have any
opinions on how difficult this would be to change?
The scoring API (just committed) gives you this option. Please see
ScoringFilter's method indexerScore.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com