Insurance Squared Inc. wrote:
I'm trying to get rid of some spammy sites in our index.
First, I wonder if anyone has any suggestions on changes to the default install config of Nutch that will help drive better sites to the top and spammier sites down.

What is a "better" site? Depending on how you define this, and how precise is your definition, you should get clear indications how to improve the quality.


Secondly, I boosted the inbound anchor text config - but if anything that made things worse. A lot of the spammier sites heavily use search terms intheir internal anchors. So I'm wondering - is there any easy way to distinguish between anchor text from within the same domain vs. anchor text from external domains, and give them different weightings? I expect this isn't the case currently - anyone have any opinions on how difficult this would be to change?

The scoring API (just committed) gives you this option. Please see ScoringFilter's method indexerScore.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to