Hi, Would it not be a good idea to patch DomContentUtils with an option not to consider relative outlinks without a base url? This example [1] will currently quickly take over the crawl db and produce countless unique URL's that cannot be filtered out with the regex that detects repeating URI segments.
There are many websites on the internet that suffer from this problem. A patch would protect this common crawler trap but not against incorrect absolute URL's - one that is supposed to be absolute but for example has an incorrect protocol scheme. [1]: http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/ Cheers, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

