Hi,

Would it not be a good idea to patch DomContentUtils with an option not to 
consider relative outlinks without a base url? This example [1] will currently 
quickly take over the crawl db and produce countless unique URL's that cannot 
be filtered out with the regex that detects repeating URI segments.

There are many websites on the internet that suffer from this problem.

A patch would protect this common crawler trap but not against incorrect 
absolute URL's - one that is supposed to be absolute but for example has an 
incorrect protocol scheme.

[1]: http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/

Cheers,
-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to