Hi Markus,

Please correct me if I'm wrong, but isn't there a document signature check
to detect if the page contains same content with some other already parsed
and indexed.

Dinçer

2011/9/12 Markus Jelsma <[email protected]>

> Hi,
>
> Would it not be a good idea to patch DomContentUtils with an option not to
> consider relative outlinks without a base url? This example [1] will
> currently
> quickly take over the crawl db and produce countless unique URL's that
> cannot
> be filtered out with the regex that detects repeating URI segments.
>
> There are many websites on the internet that suffer from this problem.
>
> A patch would protect this common crawler trap but not against incorrect
> absolute URL's - one that is supposed to be absolute but for example has an
> incorrect protocol scheme.
>
> [1]:
> http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/
>
> Cheers,
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Reply via email to