On 5/30/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> I am not configuring crawl-urlfilter.txt because I am not using the
> "bin/nutch crawl" tool. Instead I am calling "bin/nutch generate",
> fetch, update, etc. from a script.
>
> In that case I should be configuring regex-urlfilter.txt instead of
> crawl-urlfilter.txt. Am I right?

Yes, if you are not using 'crawl' command you should change regex-urlfilter.txt.

>
> On 5/30/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
> > If you are unsure of your regex you might want to try this regex applet
> >
> > http://jakarta.apache.org/regexp/applet.html
> >
> > Also, I do all my filtering in crawl-urlfilter.txt
> > I guess you must also, unless you have configured crawl-tool.xml to use
> > your other file.
> >
> > <property>
> >   <name>urlfilter.regex.file</name>
> >   <value>crawl-urlfilter.txt</value>
> > </property>
> >
> > -Ronny
> >
> >
> >
> > -----Opprinnelig melding-----
> > Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED]
> > Sendt: 30. mai 2007 13:42
> > Til: [EMAIL PROTECTED]
> > Emne: I don't want to crawl internet sites
> >
> > This is my regex-urlfilter.txt file.
> >
> >
> > -^http://([a-z0-9]*\.)+
> > +^http://([0-9]+\.)+
> > +.
> >
> > I want to allow only IP addresses and internal sites to be crawled and
> > fetched. This means:-
> >
> > http://www.google.com should be ignored
> > http://shoppingcenter should be crawled
> > http://192.168.101.5 should be crawled.
> >
> > But when I see the logs, I find that http://someone.blogspot.com/ has
> > also been crawled. How is it possible?
> >
> > Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in
> > what order are the filters called?
> >
> > !DSPAM:465d634894881383415936!
> >
> >
>


-- 
Doğacan Güney
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to