Re: [Nutch-general] Nutch crawls blocked sites - Why?

Manoharam Reddy Tue, 29 May 2007 04:47:45 -0700

Thanks! It worked.


On 5/28/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:
> Hi,
>
> On 5/28/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> > In my crawl-urlfilter.txt I have put a statement like
> >
> > -^http://cdserver
> >
> > Still while running crawl, it fetches this site. I am running the
> > crawl using these commands:-
> >
> > bin/nutch inject crawl/crawldb urls
> >
> > Inside a loop:-
> >
> > bin/nutch generate crawl/crawldb crawl/segments -topN 10
> > segment=`ls -d crawl/segments/* | tail -1`
> > bin/nutch fetch $segment -threads 10
> > bin/nutch updatedb crawl/crawldb $segment
> >
> > Why does it fetch http://cdserver even though I have blocked it? Is it
> > becoming "allowed" from some other filter file? If so, what do I need
> > to check. Please help.
> >
>
> In your case, crawl-urlfilter.txt is not read because you are not
> running 'crawl' command (as in bin/nutch crawl). You have to update
> regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you
> enable them in your conf.
>
> --
> Doğacan Güney
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
Nutch-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Nutch crawls blocked sites - Why?

Reply via email to