[Nutch-general] Nutch crawls blocked sites - Why?

Manoharam Reddy Mon, 28 May 2007 03:23:17 -0700

In my crawl-urlfilter.txt I have put a statement like

-^http://cdserver


Still while running crawl, it fetches this site. I am running the
crawl using these commands:-

bin/nutch inject crawl/crawldb urls

Inside a loop:-

bin/nutch generate crawl/crawldb crawl/segments -topN 10
segment=`ls -d crawl/segments/* | tail -1`
bin/nutch fetch $segment -threads 10
bin/nutch updatedb crawl/crawldb $segment

Why does it fetch http://cdserver even though I have blocked it? Is it
becoming "allowed" from some other filter file? If so, what do I need
to check. Please help.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Nutch crawls blocked sites - Why?

Reply via email to