Hi Ivannie, This is what I did: 1. nutch-0.9 release 2. applied NUTCH-503 (fixes generator bug that causes it to fail if first segment is empty but subsequent ones are not), NUTCH-467 (fixes dedup failure if index directory is empty) patches 3. recompiled 4. I had the following configurations: 1,2,3,4,5 slave nodes, each with 10 map threads, 10 reduce threads
So I don't think it's an issue with the number of map/reduce threads--I've also had it working with 5 threads for map and other random small prime numbers. Where did your crawl fail? Jiaqi P.S. I'm also no nutch developer, just a user. On Fri, Feb 22, 2008 at 10:56 PM, Ivannie <[EMAIL PROTECTED]> wrote: > hi, Jiaqi Tan & John Mendenhall > > i have encountered the same problem, i have tried > > correct the log4j bug and > > > http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html > > already, and it still did not work, i was working on a cluster of 4 boxes > with redhat as4 > > i also checked the hadoop.log and found nothing more important > > so i think the problem was the generator, and i saw someone said it might > caused by setting bad mapred.map.tasks and mapred.reduce.tasks, i had 4 PCs > and followed the explanation of mapred.map.tasks and mapred.reduce.tasks, i > set 17 and 7, was it right? can someone help me? > > thanks > > ivannie > > > >> 08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected > >> for fetching, exiting ... > >> 08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to > fetch. > >> 08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed > >> list and URL filters. > >> > >> I've inserted code at Generator.java:424, which says: > >> if (readers == null || readers.length == 0 || !readers[0].next(new > >> FloatWritable())) { > >> LOG.warn("Generator: 0 records selected for fetching, exiting ..."); > >> > >> essentially at the decision point to see which of the conditions > >> triggered the 0 records selected message, and the "readers" object is > >> perfectly fine, but the SequenceFileOutputFormat is reporting there > >> are no values (I suppose of URL scores) at all to be retrieved, > >> causing the generator to stop. > > > > >There is a problem with the Generator. There was a change committed > >after 0.9 was released. I implemented this change and it fixed my > >problem: > > > >http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html > > > > >JohnM > > > >-- > >john mendenhall > >[EMAIL PROTECTED] > >surf utopia > >internet services > > = = = = = = = = = = = = = = = = = = = = > > > >