Re: Help: Crawl returns no URLs

Jean-Francois Gingras Mon, 07 Mar 2011 09:40:25 -0800

We had the same problem last week.
If we call the nutch commands individualy (inject, generate, ...) all
was fine. But if we use the crawl command, the generate stage failed.
Crawl-tool.XML and nutch-site.XML were identical, except for the
default urlfilter. Nutch-site.XML use regex-urlfilter but
crawl-tool.XML use crawl-urlfilter. Although those two file were
identical in our setup, we had to change crawl-urlfiter for
regex-urlfilter in crawl-tool in ordre for the crawl command to work.


I know, its weird. I even think that changing regex-filter to
crawl-filter in nutch-site.XML render the call to the inject and
generate command to fail.

Hope this Will help you

Envoyé de mon iPhone

Le 2011-03-06 à 22:19, chidu r <[email protected]> a écrit :

> Hi all
>
> I am trying to setup nutch 1.2 on Hadoop and used the instructions at
> http://wiki.apache.org/nutch/NutchHadoopTutorial, it has been very useful.
>
> However, I find that when I execute the command:
>
> $bin/nutch crawl urls -dir crawl -depth 4 -topN 50
>
> The crawler stops at the generator stage with the message:
> 2011-03-06 17:23:49,538 WARN  crawl.Generator - Generator: 0 records
> selected for fetching, exiting ...
>
> I have configured the following plugins in nutch-site.xml
> protocol-http|parse-(text|html|js)|urlnormalizer-(pass|regex|basic)|urlfilter-regex|index-(basic|anchor)
>
> I am not using crawl-urlfilter.txt or regex-urlfilter.txt tp filter URLs. I
> initiated the crawl with 10 seed urls from popular sites on internet.
>
> Any pointers to what I am missing here?
>
>
> regards
> Chidu

Re: Help: Crawl returns no URLs

Reply via email to