Oki. I think the crawl tool is for intranet fetching and I though that was what you wanted?
http://lucene.apache.org/nutch/tutorial8.html Anyway, I do not have any experience not using crawl so I suppose someone else must help you. -----Opprinnelig melding----- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 14:58 Til: [EMAIL PROTECTED] Emne: Re: I don't want to crawl internet sites I am not configuring crawl-urlfilter.txt because I am not using the "bin/nutch crawl" tool. Instead I am calling "bin/nutch generate", fetch, update, etc. from a script. In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny <[EMAIL PROTECTED]> wrote: > If you are unsure of your regex you might want to try this regex > applet > > http://jakarta.apache.org/regexp/applet.html > > Also, I do all my filtering in crawl-urlfilter.txt I guess you must > also, unless you have configured crawl-tool.xml to use your other > file. > > <property> > <name>urlfilter.regex.file</name> > <value>crawl-urlfilter.txt</value> > </property> > > -Ronny > > > > -----Opprinnelig melding----- > Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] > Sendt: 30. mai 2007 13:42 > Til: [EMAIL PROTECTED] > Emne: I don't want to crawl internet sites > > This is my regex-urlfilter.txt file. > > > -^http://([a-z0-9]*\.)+ > +^http://([0-9]+\.)+ > +. > > I want to allow only IP addresses and internal sites to be crawled and > fetched. This means:- > > http://www.google.com should be ignored http://shoppingcenter should > be crawled > http://192.168.101.5 should be crawled. > > But when I see the logs, I find that http://someone.blogspot.com/ has > also been crawled. How is it possible? > > Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, > in what order are the filters called? > > > > !DSPAM:465d750470581367111490! ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list Nutch-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-general