I am not configuring crawl-urlfilter.txt because I am not using the
"bin/nutch crawl" tool. Instead I am calling "bin/nutch generate",
fetch, update, etc. from a script.

In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?

On 5/30/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
If you are unsure of your regex you might want to try this regex applet

http://jakarta.apache.org/regexp/applet.html

Also, I do all my filtering in crawl-urlfilter.txt
I guess you must also, unless you have configured crawl-tool.xml to use
your other file.

<property>
  <name>urlfilter.regex.file</name>
  <value>crawl-urlfilter.txt</value>
</property>

-Ronny



-----Opprinnelig melding-----
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
Sendt: 30. mai 2007 13:42
Til: [email protected]
Emne: I don't want to crawl internet sites

This is my regex-urlfilter.txt file.


-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.

I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-

http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be crawled.

But when I see the logs, I find that http://someone.blogspot.com/ has
also been crawled. How is it possible?

Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
what order are the filters called?

!DSPAM:465d634894881383415936!


Reply via email to