Hi Ivannie,

This is what I did:
1. nutch-0.9 release
2. applied NUTCH-503 (fixes generator bug that causes it to fail if
first segment is empty but subsequent ones are not), NUTCH-467 (fixes
dedup failure if index directory is empty) patches
3. recompiled
4. I had the following configurations:
1,2,3,4,5 slave nodes, each with 10 map threads, 10 reduce threads

So I don't think it's an issue with the number of map/reduce
threads--I've also had it working with 5 threads for map and other
random small prime numbers.

Where did your crawl fail?

Jiaqi

P.S. I'm also no nutch developer, just a user.

On Fri, Feb 22, 2008 at 10:56 PM, Ivannie <[EMAIL PROTECTED]> wrote:
> hi, Jiaqi Tan & John Mendenhall
>
>  i have encountered the same problem, i have tried
>
>  correct the log4j bug and
>
>
>  http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html
>
>  already, and it still did not work, i was working on a cluster of 4 boxes 
> with redhat as4
>
>  i also checked the hadoop.log and found nothing more important
>
>  so i think the problem was the generator, and i saw someone said it might 
> caused by setting bad mapred.map.tasks and mapred.reduce.tasks, i had 4 PCs 
> and followed the explanation of mapred.map.tasks and mapred.reduce.tasks, i 
> set 17 and 7, was it right? can someone help me?
>
>  thanks
>
>  ivannie
>
>
>  >> 08/02/20 15:38:09 WARN crawl.Generator: Generator: 0 records selected
>  >> for fetching, exiting ...
>  >> 08/02/20 15:38:09 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to 
> fetch.
>  >> 08/02/20 15:38:09 WARN crawl.Crawl: No URLs to fetch - check your seed
>  >> list and URL filters.
>  >>
>  >> I've inserted code at Generator.java:424, which says:
>  >> if (readers == null || readers.length == 0 || !readers[0].next(new
>  >> FloatWritable())) {
>  >>    LOG.warn("Generator: 0 records selected for fetching, exiting ...");
>  >>
>  >> essentially at the decision point to see which of the conditions
>  >> triggered the 0 records selected message, and the "readers" object is
>  >> perfectly fine, but the SequenceFileOutputFormat is reporting there
>  >> are no values (I suppose of URL scores) at all to be retrieved,
>  >> causing the generator to stop.
>  >
>
> >There is a problem with the Generator.  There was a change committed
>  >after 0.9 was released.  I implemented this change and it fixed my
>  >problem:
>  >
>  >http://www.mail-archive.com/[EMAIL PROTECTED]/msg01991.html
>  >
>
> >JohnM
>  >
>  >--
>  >john mendenhall
>  >[EMAIL PROTECTED]
>  >surf utopia
>  >internet services
>
>  = = = = = = = = = = = = = = = = = = = =
>
>
>
>

Reply via email to