Nutch 2.x for large-scale crawls

Joseph Naegele Fri, 17 Jun 2016 06:00:22 -0700

Hi folks,

I am curious as to whether Nutch 2.x might solve some of the problems we are
experiencing with Nuch 1.11 at a very large scale (multiple billions of
URLs). For now, the primary issue is the size of the crawldb and the time it
takes to update, as well as the time it takes to index individual segments.
I'm aware of the development on NUTCH-2184, enabling indexing without the
crawldb, and if we stick with Nutch 1.x I'll rely heavily on that feature.
We also compute LinkRank, which is very time-consuming, but I imagine that
won't change much.


1. Does Nutch 2.x's architecture alleviate some of this issue? I know, for
example, the updatedb step is intended to be much more efficient using Gora
rather than reading/writing the entire crawldb using Hadoop data structures.

2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
"Storage and Data Flow" diagram
(http://image.slidesharecdn.com/aceu2014-snagel-web-crawling-nutch-141125144
922-conversion-gate01/95/web-crawling-with-apache-nutch-16-638.jpg?cb=141692
7690)? I found that diagram very helpful in understanding Nutch 1.x
segments.

3. Is anyone aware of recent benchmarks comparing 1.x and 2.x, specifically
2.3?

I'd love to give 2.x a spin and evaluate it myself, but it would be very
costly to compare the two at the scale I'm referring to.

Thanks,
Joe

Nutch 2.x for large-scale crawls

Reply via email to