Your regexp are incorrect. Both of them require the address to end
with dot. Try this one:
+^http://([0-9]{1,3}\.){3}[0-9]{1,3}
-^http://([a-z0-9]+\.)+[a-z0-9]+
+.
Note, I moved the IP pattern first, because the deny pattern matches
as well. You should try you patterns first, outside Nutch.
If you are unsure of your regex you might want to try this regex applet
http://jakarta.apache.org/regexp/applet.html
Also, I do all my filtering in crawl-urlfilter.txt
I guess you must also, unless you have configured crawl-tool.xml to use
your other file.
property
I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.
In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?
On 5/30/07, Naess, Ronny
Oki. I think the crawl tool is for intranet fetching and I though that
was what you wanted?
http://lucene.apache.org/nutch/tutorial8.html
Anyway, I do not have any experience not using crawl so I suppose
someone else must help you.
-Opprinnelig melding-
Fra: Manoharam Reddy
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote:
I am not configuring crawl-urlfilter.txt because I am not using the
bin/nutch crawl tool. Instead I am calling bin/nutch generate,
fetch, update, etc. from a script.
In that case I should be configuring regex-urlfilter.txt instead of