We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.
I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.
Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?
I've read a paper on detecting link farms, but from what I remember,
it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruning the
results from the crawldb and adding them to the regex-urlfilter file.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"