Hi folks, I am curious as to whether Nutch 2.x might solve some of the problems we are experiencing with Nuch 1.11 at a very large scale (multiple billions of URLs). For now, the primary issue is the size of the crawldb and the time it takes to update, as well as the time it takes to index individual segments. I'm aware of the development on NUTCH-2184, enabling indexing without the crawldb, and if we stick with Nutch 1.x I'll rely heavily on that feature. We also compute LinkRank, which is very time-consuming, but I imagine that won't change much.
1. Does Nutch 2.x's architecture alleviate some of this issue? I know, for example, the updatedb step is intended to be much more efficient using Gora rather than reading/writing the entire crawldb using Hadoop data structures. 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x "Storage and Data Flow" diagram (http://image.slidesharecdn.com/aceu2014-snagel-web-crawling-nutch-141125144 922-conversion-gate01/95/web-crawling-with-apache-nutch-16-638.jpg?cb=141692 7690)? I found that diagram very helpful in understanding Nutch 1.x segments. 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x, specifically 2.3? I'd love to give 2.x a spin and evaluate it myself, but it would be very costly to compare the two at the scale I'm referring to. Thanks, Joe