Hi Makkara, > but I believe that this is the fault of the reducer > Map input records=22048 > Map output records=4
The items are skipped in the mapper. > Is this a known problem of Nutch 2.4, or have I just misconfigured > something? Could be the configuration or a bug in the storage layer causing not all items of the web table sent to the mapper. Please also note that we expect that 2.4 is the last release on the 2.X series. We've decided to freeze the development on the 2.X branch for now, as no committer is actively working on it. Nutch 1.x is actively maintained. Best, Sebastian On 11/13/19 3:42 PM, Makkara Mestari wrote: > > > Hello > > I have injected about 1300 domains to the seed list. > > First two fetches work nicely, but after that, the crawler will only select > urls from a few domains, leaving all other urls permanently with the status 1 > (unfetched), which number in tens of thousands. Currently the generator only > generates the same 4 urls every time, that are unreachable pages. > > Im not sure, but I believe that this is the fault of the reducer, here is a > sample of output during generation phase with setting -topN 50000 > > > 2019-11-13 13:22:29,186 INFO mapreduce.Job - Job job_local1940214525_0001 > completed successfully > 2019-11-13 13:22:29,210 INFO mapreduce.Job - Counters: 34 > File System Counters > FILE: Number of bytes read=1313864 > FILE: Number of bytes written=1904695 > FILE: Number of read operations=0 > FILE: Number of large read operations=0 > FILE: Number of write operations=0 > Map-Reduce Framework > Map input records=22048 > Map output records=4 > Map output bytes=584 > Map output materialized bytes=599 > Input split bytes=953 > Combine input records=0 > Combine output records=0 > Reduce input groups=4 > Reduce shuffle bytes=599 > Reduce input records=4 > Reduce output records=4 > Spilled Records=8 > Shuffled Maps =1 > Failed Shuffles=0 > Merged Map outputs=1 > GC time elapsed (ms)=22 > CPU time spent (ms)=0 > Physical memory (bytes) snapshot=0 > Virtual memory (bytes) snapshot=0 > Total committed heap usage (bytes)=902823936 > Generator > GENERATE_MARK=4 > Shuffle Errors > BAD_ID=0 > CONNECTION=0 > IO_ERROR=0 > WRONG_LENGTH=0 > WRONG_MAP=0 > WRONG_REDUCE=0 > File Input Format Counters > Bytes Read=0 > File Output Format Counters > Bytes Written=0 > 2019-11-13 13:22:29,238 INFO crawl.GeneratorJob - GeneratorJob: finished at > 2019-11-13 13:22:29, time elapsed: 00:00:04 > 2019-11-13 13:22:29,238 INFO crawl.GeneratorJob - GeneratorJob: generated > batch id: 1573651344-1856402192 containing 4 URLs > > If I try resetting the crawldb, and injecting only one of the domains, then I > can crawl it compleately fine, this problem of never fetched pages only > arises if I try to work with a moderate amount of domains at the time (1300 > in this case). > > Is this a known problem of Nutch 2.4, or have I just misconfigured something? > > -Makkara >

