Hi,
is the content of the pages 'mostly' identically?
Since we can now provide custom hash implementations to the crawlDB,
what people think about local sensitive hashing?
http://citeseer.ist.psu.edu/haveliwala00scalable.html
As far I understand the paper we can implement the hashing in a style
that it allows to handle 'similar' (just change one word ) pages as
once.
My experience of link farms is that pages are identically except of
one number or word or data or something like that.
In such a case LSH may could be a interesting try to get the problem
solved.
Any thoughts?
Stefan
Am 07.03.2006 um 22:38 schrieb Ken Krugler:
We've managed to dig ourselves into a couple of link farms with
tens of
thousands of sub-domains.
I didn't notice until they blocked our DNS requests and the Nutch
error
rates shot way up.
Are there any methods for detecting these things (more than 100
sub-domains) or a master list somewhere that we can filter?
I've read a paper on detecting link farms, but from what I
remember, it wasn't a slam-dunk to implement.
So far we've relied on manually detecting these, and then pruning
the results from the crawldb and adding them to the regex-urlfilter
file.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com