Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Marcin Okraszewski
Your regexp are incorrect. Both of them require the address to end with dot. Try this one: +^http://([0-9]{1,3}\.){3}[0-9]{1,3} -^http://([a-z0-9]+\.)+[a-z0-9]+ +. Note, I moved the IP pattern first, because the deny pattern matches as well. You should try you patterns first, outside Nutch.

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny
If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. property

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Manoharam Reddy
I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Naess, Ronny
Oki. I think the crawl tool is for intranet fetching and I though that was what you wanted? http://lucene.apache.org/nutch/tutorial8.html Anyway, I do not have any experience not using crawl so I suppose someone else must help you. -Opprinnelig melding- Fra: Manoharam Reddy

Re: [Nutch-general] I don't want to crawl internet sites

2007-05-30 Thread Doğacan Güney
On 5/30/07, Manoharam Reddy [EMAIL PROTECTED] wrote: I am not configuring crawl-urlfilter.txt because I am not using the bin/nutch crawl tool. Instead I am calling bin/nutch generate, fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of