Re: Nutch crawls blocked sites - Why?

Manoharam Reddy Tue, 29 May 2007 04:47:03 -0700

Thanks! It worked.

On 5/28/07, Doğacan Güney <[EMAIL PROTECTED]> wrote:

Hi,


On 5/28/07, Manoharam Reddy <[EMAIL PROTECTED]> wrote:
> In my crawl-urlfilter.txt I have put a statement like
>
> -^http://cdserver
>
> Still while running crawl, it fetches this site. I am running the
> crawl using these commands:-
>
> bin/nutch inject crawl/crawldb urls
>
> Inside a loop:-
>
> bin/nutch generate crawl/crawldb crawl/segments -topN 10
> segment=`ls -d crawl/segments/* | tail -1`
> bin/nutch fetch $segment -threads 10
> bin/nutch updatedb crawl/crawldb $segment
>
> Why does it fetch http://cdserver even though I have blocked it? Is it
> becoming "allowed" from some other filter file? If so, what do I need
> to check. Please help.
>

In your case, crawl-urlfilter.txt is not read because you are not
running 'crawl' command (as in bin/nutch crawl). You have to update
regex-urlfilter.txt or prefix-urlfilter.txt and make sure that you
enable them in your conf.

--
Doğacan Güney

Re: Nutch crawls blocked sites - Why?

Reply via email to