I'm trying to get rid of some spammy sites in our index. First, I wonder if anyone has any suggestions on changes to the default install config of Nutch that will help drive better sites to the top and spammier sites down.

Secondly, I boosted the inbound anchor text config - but if anything that made things worse. A lot of the spammier sites heavily use search terms intheir internal anchors. So I'm wondering - is there any easy way to distinguish between anchor text from within the same domain vs. anchor text from external domains, and give them different weightings? I expect this isn't the case currently - anyone have any opinions on how difficult this would be to change?

Thanks,
g.

Reply via email to