Hi Markus, Please correct me if I'm wrong, but isn't there a document signature check to detect if the page contains same content with some other already parsed and indexed.
Dinçer 2011/9/12 Markus Jelsma <[email protected]> > Hi, > > Would it not be a good idea to patch DomContentUtils with an option not to > consider relative outlinks without a base url? This example [1] will > currently > quickly take over the crawl db and produce countless unique URL's that > cannot > be filtered out with the regex that detects repeating URI segments. > > There are many websites on the internet that suffer from this problem. > > A patch would protect this common crawler trap but not against incorrect > absolute URL's - one that is supposed to be absolute but for example has an > incorrect protocol scheme. > > [1]: > http://www.hollandopera.nl/voorstellingen/archief/voorstellingen/item/1/ > > Cheers, > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536620 / 06-50258350 >

