RE: Why won't my crawl ignore these urls?

Markus Jelsma Mon, 30 Jul 2012 11:05:57 -0700

Hi,

Either your regex is wrong, you haven't updated the CrawlDB with the new 
filters and/or you disabled filtering in the Generator.


Cheers

 
 
-----Original message-----
> From:Ian Piper <ianpi...@tellura.co.uk>
> Sent: Mon 30-Jul-2012 20:01
> To: user@nutch.apache.org
> Subject: Why won't my crawl ignore these urls?
> 
> Hi all,
> 
> I have been trying to get to the bottom of this problem for ages and cannot 
> resolve it - you're my last hope, Obi-Wan...
> 
> I have a job that crawls over a client's site. I want to exclude urls that 
> look like this:
> 
> http://[clientsite.net]/resources/type.aspx?type=[whatever] 
> <http://[clientsite.net]/resources/type.aspx?type=[whatever]> 
> 
> and
> 
> http://[clientsite.net]/resources/topic.aspx?topic=[whatever] 
> <http://[clientsite.net]/resources/topic.aspx?topic=[whatever]> 
> 
> 
> To achieve this I thought I could put this into conf/regex-urlfilter.txt:
> 
> [...]
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/type.aspx.*
> -^http://([a-z0-9\-A-Z]*\.)*www.elaweb.org.uk/resources/topic.aspx.*
> [...]
> 
> Yet when I next run the crawl I see things like this:
> 
> fetching http://[clientsite.net]/resources/topic.aspx?topic=10 
> <http://[clientsite.net]/resources/topic.aspx?topic=10> 
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=37
> [...]
> fetching http://[clientsite.net]/resources/type.aspx?type=2 
> <http://[clientsite.net]/resources/type.aspx?type=2> 
> -activeThreads=10, spinWaiting=9, fetchQueues.totalSize=36
> [...]
> 
> and the corresponding pages seem to appear in the final Solr index. So 
> clearly they are not being excluded.
> 
> Is anyone able to explain what I have missed? Any guidance much appreciated.
> 
> Thanks,
> 
> 
> Ian.
> -- 
> Dr Ian Piper
> Tellura Information Services - the web, document and information people
> Registered in England and Wales: 5076715, VAT Number: 874 2060 29
> http://www.tellura.co.uk/ <http://www.tellura.co.uk/> 
> Creator of monickr: http://monickr.com <http://monickr.com/> 
> 01926 813736 | 07973 156616
> -- 
> 
> 
>

RE: Why won't my crawl ignore these urls?

Reply via email to