In my crawl-urlfilter.txt I have put a statement like
-^http://cdserver Still while running crawl, it fetches this site. I am running the crawl using these commands:- bin/nutch inject crawl/crawldb urls Inside a loop:- bin/nutch generate crawl/crawldb crawl/segments -topN 10 segment=`ls -d crawl/segments/* | tail -1` bin/nutch fetch $segment -threads 10 bin/nutch updatedb crawl/crawldb $segment Why does it fetch http://cdserver even though I have blocked it? Is it becoming "allowed" from some other filter file? If so, what do I need to check. Please help.
