Oki. I think the crawl tool is for intranet fetching and I though that
was what you wanted?

http://lucene.apache.org/nutch/tutorial8.html

Anyway, I do not have any experience not using crawl so I suppose
someone else must help you.



-----Opprinnelig melding-----
Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] 
Sendt: 30. mai 2007 14:58
Til: [EMAIL PROTECTED]
Emne: Re: I don't want to crawl internet sites

I am not configuring crawl-urlfilter.txt because I am not using the
"bin/nutch crawl" tool. Instead I am calling "bin/nutch generate",
fetch, update, etc. from a script.

In that case I should be configuring regex-urlfilter.txt instead of
crawl-urlfilter.txt. Am I right?

On 5/30/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> If you are unsure of your regex you might want to try this regex 
> applet
>
> http://jakarta.apache.org/regexp/applet.html
>
> Also, I do all my filtering in crawl-urlfilter.txt I guess you must 
> also, unless you have configured crawl-tool.xml to use your other 
> file.
>
> <property>
>   <name>urlfilter.regex.file</name>
>   <value>crawl-urlfilter.txt</value>
> </property>
>
> -Ronny
>
>
>
> -----Opprinnelig melding-----
> Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
> Sendt: 30. mai 2007 13:42
> Til: [EMAIL PROTECTED]
> Emne: I don't want to crawl internet sites
>
> This is my regex-urlfilter.txt file.
>
>
> -^http://([a-z0-9]*\.)+
> +^http://([0-9]+\.)+
> +.
>
> I want to allow only IP addresses and internal sites to be crawled and

> fetched. This means:-
>
> http://www.google.com should be ignored http://shoppingcenter should 
> be crawled
> http://192.168.101.5 should be crawled.
>
> But when I see the logs, I find that http://someone.blogspot.com/ has 
> also been crawled. How is it possible?
>
> Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, 
> in what order are the filters called?
>
> 
>
>

!DSPAM:465d750470581367111490!


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to