Hi,

is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the crawlDB, what people think about local sensitive hashing?

http://citeseer.ist.psu.edu/haveliwala00scalable.html

As far I understand the paper we can implement the hashing in a style that it allows to handle 'similar' (just change one word ) pages as once. My experience of link farms is that pages are identically except of one number or word or data or something like that. In such a case LSH may could be a interesting try to get the problem solved.

Any thoughts?

Stefan


Am 07.03.2006 um 22:38 schrieb Ken Krugler:

We've managed to dig ourselves into a couple of link farms with tens of
thousands of sub-domains.

I didn't notice until they blocked our DNS requests and the Nutch error
rates shot way up.

Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?

I've read a paper on detecting link farms, but from what I remember, it wasn't a slam-dunk to implement.

So far we've relied on manually detecting these, and then pruning the results from the crawldb and adding them to the regex-urlfilter file.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com


Reply via email to