Hi Makkara,

> but I believe that this is the fault of the reducer
>                 Map input records=22048
>                 Map output records=4

The items are skipped in the mapper.

> Is this a known problem of Nutch 2.4, or have I just misconfigured
> something?

Could be the configuration or a bug in the storage layer causing not all items 
of the web table sent
to the mapper.

Please also note that we expect that 2.4 is the last release on the 2.X series. 
We've decided to
freeze the development on the 2.X branch for now, as no committer is actively 
working on it. Nutch
1.x is actively maintained.

Best,
Sebastian

On 11/13/19 3:42 PM, Makkara Mestari wrote:
> 
> 
> Hello
>  
> I have injected about 1300 domains to the seed list.
>  
> First two fetches work nicely, but after that, the crawler will only select 
> urls from a few domains, leaving all other urls permanently with the status 1 
> (unfetched), which number in tens of thousands. Currently the generator only 
> generates the same 4 urls every time, that are unreachable pages.
>  
> Im not sure, but I believe that this is the fault of the reducer, here is a 
> sample of output during generation phase with setting -topN  50000
>  
> 
> 2019-11-13 13:22:29,186 INFO  mapreduce.Job - Job job_local1940214525_0001 
> completed successfully
> 2019-11-13 13:22:29,210 INFO  mapreduce.Job - Counters: 34
>         File System Counters
>                 FILE: Number of bytes read=1313864
>                 FILE: Number of bytes written=1904695
>                 FILE: Number of read operations=0
>                 FILE: Number of large read operations=0
>                 FILE: Number of write operations=0
>         Map-Reduce Framework
>                 Map input records=22048
>                 Map output records=4
>                 Map output bytes=584
>                 Map output materialized bytes=599
>                 Input split bytes=953
>                 Combine input records=0
>                 Combine output records=0
>                 Reduce input groups=4
>                 Reduce shuffle bytes=599
>                 Reduce input records=4
>                 Reduce output records=4
>                 Spilled Records=8
>                 Shuffled Maps =1
>                 Failed Shuffles=0
>                 Merged Map outputs=1
>                 GC time elapsed (ms)=22
>                 CPU time spent (ms)=0
>                 Physical memory (bytes) snapshot=0
>                 Virtual memory (bytes) snapshot=0
>                 Total committed heap usage (bytes)=902823936
>         Generator
>                 GENERATE_MARK=4
>         Shuffle Errors
>                 BAD_ID=0
>                 CONNECTION=0
>                 IO_ERROR=0
>                 WRONG_LENGTH=0
>                 WRONG_MAP=0
>                 WRONG_REDUCE=0
>         File Input Format Counters
>                 Bytes Read=0
>         File Output Format Counters
>                 Bytes Written=0
> 2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: finished at 
> 2019-11-13 13:22:29, time elapsed: 00:00:04
> 2019-11-13 13:22:29,238 INFO  crawl.GeneratorJob - GeneratorJob: generated 
> batch id: 1573651344-1856402192 containing 4 URLs
>  
> If I try resetting the crawldb, and injecting only one of the domains, then I 
> can crawl it compleately fine, this problem of never fetched pages only 
> arises if I try to work with a moderate amount of domains at the time (1300 
> in this case).
>  
> Is this a known problem of Nutch 2.4, or have I just misconfigured something?
>  
> -Makkara
> 

Reply via email to