Your regexp are incorrect. Both of them require the address to end
with dot. Try this one:
+^http://([0-9]{1,3}\.){3}[0-9]{1,3}
-^http://([a-z0-9]+\.)+[a-z0-9]+
+.
Note, I moved the IP pattern first, because the deny pattern matches
as well. You should try you patterns first, outside Nutch.
Cheers,
Marcin
On 5/30/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
This is my regex-urlfilter.txt file.
-^http://([a-z0-9]*\.)+
+^http://([0-9]+\.)+
+.
I want to allow only IP addresses and internal sites to be crawled and
fetched. This means:-
http://www.google.com should be ignored
http://shoppingcenter should be crawled
http://192.168.101.5 should be crawled.
But when I see the logs, I find that http://someone.blogspot.com/ has
also been crawled. How is it possible?
Is my regex-urlfilter.txt wrong? Are there other URL filters? If so,
in what order are the filters called?