Hi again, Regarding disabling filters. I just checked in my nutch-default.xml and nutch-site.xml files. There is no reference to crawl.generate in either, which seems (http://wiki.apache.org/nutch/bin/nutch_generate) to suggest that urls should be filtered.
Ian. -- On 30 Jul 2012, at 19:06, Markus Jelsma wrote: > Hi, > > Either your regex is wrong, you haven't updated the CrawlDB with the new > filters and/or you disabled filtering in the Generator. > > Cheers > > > > -----Original message----- >> From:Ian Piper <ianpi...@tellura.co.uk> >> Sent: Mon 30-Jul-2012 20:01 >> To: user@nutch.apache.org >> Subject: Why won't my crawl ignore these urls? >> >> Hi all, >> >> I have been trying to get to the bottom of this problem for ages and cannot >> resolve it - you're my last hope, Obi-Wan... >> >> I have a job that crawls over a client's site. I want to exclude urls that >> look like this: >> >> http://[clientsite.net]/resources/type.aspx?type=[whatever] >> <http://[clientsite.net]/resources/type.aspx?type=[whatever]> >> >> and >> >> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] >> <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> >> >> >> To achieve this I thought I could put this into conf/regex-urlfilter.txt: >> >> [...] >> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.* >> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.* >> [...] >> >> Yet when I next run the crawl I see things like this: >> >> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 >> <http://[clientsite.net]/resources/topic.aspx?topic=10> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37 >> [...] >> fetching http://[clientsite.net]/resources/type.aspx?type=2 >> <http://[clientsite.net]/resources/type.aspx?type=2> >> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36 >> [...] >> >> and the corresponding pages seem to appear in the final Solr index. So >> clearly they are not being excluded. >> >> Is anyone able to explain what I have missed? Any guidance much appreciated. >> >> Thanks, >> >> >> Ian. >> -- >> Dr Ian Piper >> Tellura Information Services - the web, document and information people >> Registered in England and Wales: 5076715, VAT Number: 874 2060 29 >> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> >> Creator of monickr: http://monickr.com <http://monickr.com/> >> 01926 813736 | 07973 156616 >> -- >> >> -- Dr Ian Piper Tellura Information Services - the web, document and information people Registered in England and Wales: 5076715, VAT Number: 874 2060 29 http://www.tellura.co.uk/ Creator of monickr: http://monickr.com 01926 813736 | 07973 156616 --