Your regexp are incorrect. Both of them require the address to end with dot. Try this one:
+^http://([0-9]{1,3}\.){3}[0-9]{1,3} -^http://([a-z0-9]+\.)+[a-z0-9]+ +. Note, I moved the IP pattern first, because the deny pattern matches as well. You should try you patterns first, outside Nutch. Cheers, Marcin On 5/30/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote: > This is my regex-urlfilter.txt file. > > > -^http://([a-z0-9]*\.)+ > +^http://([0-9]+\.)+ > +. > > I want to allow only IP addresses and internal sites to be crawled and > fetched. This means:- > > http://www.google.com should be ignored > http://shoppingcenter should be crawled > http://192.168.101.5 should be crawled. > > But when I see the logs, I find that http://someone.blogspot.com/ has > also been crawled. How is it possible? > > Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, > in what order are the filters called? > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general