Yes, we use several deduplication mechanisms and they work fine. The problem 
is wasting a lot of CPU cycles for nothing. Why not stop those unwanted URL's 
from entering the CrawlDB in the first place instead of getting rid of them 
afterwards? 

Growth of the CrawlDB is something very significant, especially with thousands 
of long URL's.

On Tuesday 13 September 2011 12:54:21 Dinçer Kavraal wrote:
> Hi Markus,
> 
> Please correct me if I'm wrong, but isn't there a document signature check
> to detect if the page contains same content with some other already parsed
> and indexed.
> 
> Dinçer
> 
> 2011/9/12 Markus Jelsma <[email protected]>
> 
> > Hi,
> > 
> > Would it not be a good idea to patch DomContentUtils with an option not
> > to consider relative outlinks without a base url? This example [1] will
> > currently
> > quickly take over the crawl db and produce countless unique URL's that
> > cannot
> > be filtered out with the regex that detects repeating URI segments.
> > 
> > There are many websites on the internet that suffer from this problem.
> > 
> > A patch would protect this common crawler trap but not against incorrect
> > absolute URL's - one that is supposed to be absolute but for example has
> > an incorrect protocol scheme.
> > 
> > [1]:
> > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
> > 
> > Cheers,
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to