We had the same problem last week. If we call the nutch commands individualy (inject, generate, ...) all was fine. But if we use the crawl command, the generate stage failed. Crawl-tool.XML and nutch-site.XML were identical, except for the default urlfilter. Nutch-site.XML use regex-urlfilter but crawl-tool.XML use crawl-urlfilter. Although those two file were identical in our setup, we had to change crawl-urlfiter for regex-urlfilter in crawl-tool in ordre for the crawl command to work.
I know, its weird. I even think that changing regex-filter to crawl-filter in nutch-site.XML render the call to the inject and generate command to fail. Hope this Will help you Envoyé de mon iPhone Le 2011-03-06 à 22:19, chidu r <[email protected]> a écrit : > Hi all > > I am trying to setup nutch 1.2 on Hadoop and used the instructions at > http://wiki.apache.org/nutch/NutchHadoopTutorial, it has been very useful. > > However, I find that when I execute the command: > > $bin/nutch crawl urls -dir crawl -depth 4 -topN 50 > > The crawler stops at the generator stage with the message: > 2011-03-06 17:23:49,538 WARN crawl.Generator - Generator: 0 records > selected for fetching, exiting ... > > I have configured the following plugins in nutch-site.xml > protocol-http|parse-(text|html|js)|urlnormalizer-(pass|regex|basic)|urlfilter-regex|index-(basic|anchor) > > I am not using crawl-urlfilter.txt or regex-urlfilter.txt tp filter URLs. I > initiated the crawl with 10 seed urls from popular sites on internet. > > Any pointers to what I am missing here? > > > regards > Chidu

