Re: [Nutch-general] I don't want to crawl internet sites

Marcin Okraszewski Wed, 30 May 2007 05:12:42 -0700

Your regexp are incorrect. Both of them require the address to end
with dot. Try this one:


+^http://([0-9]{1,3}\.){3}[0-9]{1,3}
-^http://([a-z0-9]+\.)+[a-z0-9]+
+.

Note, I moved the IP pattern first, because the deny pattern matches
as well. You should try you patterns first, outside Nutch.

Cheers,
Marcin

On 5/30/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> This is my regex-urlfilter.txt file.
>
>
> -^http://([a-z0-9]*\.)+
> +^http://([0-9]+\.)+
> +.
>
> I want to allow only IP addresses and internal sites to be crawled and
> fetched. This means:-
>
> http://www.google.com should be ignored
> http://shoppingcenter should be crawled
> http://192.168.101.5 should be crawled.
>
> But when I see the logs, I find that http://someone.blogspot.com/ has
> also been crawled. How is it possible?
>
> Is my regex-urlfilter.txt wrong? Are there other URL filters? If so,
> in what order are the filters called?
>

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] I don't want to crawl internet sites

Reply via email to