Re: Nutch 2.x for large-scale crawls

Julien Nioche Mon, 20 Jun 2016 03:03:02 -0700

Hi Joseph,

I meant to update the benchmarks for a while but haven't found the time to
do so. I will probably add StormCrawler to the mix next time.


One thing that helped with the performance when I was running very large
crawls with Nutch 1.x was to generate multiple segments in one go, fetch
and parse them sequentially then update the whole lot with the crawldb.
This saves you those costly generate and update steps. The number of
segments to generate is entirely up to you but even a modest value like 3
or 5 would have quite an impact on the performance of the crawler. Do you
already do this with 1.x?

Julien

On 17 June 2016 at 21:40, Sebastian Nagel <wastl.na...@googlemail.com>
wrote:

> Hi,
>
> > 1. Does Nutch 2.x's architecture alleviate some of this issue?
>
> That is/was the objective of Nutch 2.x, inspired by the Bigtable [1]
> and Percolator [2] papers.
>
> > 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
> > "Storage and Data Flow" diagram
>
> Don't know. But it's pretty simple - everything is stored in one table.
> Rows are pages, and from the code is clear which fields/columns are
> accessed (read or write) by steps/tools, e.g. by the ParserJob:
>   static {
>     FIELDS.add(WebPage.Field.STATUS);
>     FIELDS.add(WebPage.Field.CONTENT);
>     FIELDS.add(WebPage.Field.CONTENT_TYPE);
>     FIELDS.add(WebPage.Field.SIGNATURE);
>     FIELDS.add(WebPage.Field.MARKERS);
>     FIELDS.add(WebPage.Field.PARSE_STATUS);
>     FIELDS.add(WebPage.Field.OUTLINKS);
>     FIELDS.add(WebPage.Field.METADATA);
>     FIELDS.add(WebPage.Field.HEADERS);
>     FIELDS.add(WebPage.Field.SITEMAPS);
>     FIELDS.add(WebPage.Field.STM_PRIORITY);
>   }
> That's a notable simplification compared to 1.x where it is really hard
> to understand the data flow.
>
> > 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x,
> specifically 2.3?
>
> I guess you know about Julien's "Nutch fight! 1.7 vs 2.2.1" [3]
> Afaik, there's no recent update.
>
> Sebastian
>
> [1] Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; Wallach,
> Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; Gruber,
> Robert E., 2006: Bigtable: A distributed storage system for structured
> data. In: Proceedings of the 7th Conference on USENIX Symposium on
> Operating Systems Design and Implementation (OSDI ’06), vol. 7, pp.
> 205–218, http://www.usenix.org/events/osdi06/tech/chang/chang.pdf
>
> [2] Peng, Daniel Dabek, Frank, 2010: Large-scale incremental processing
> using distributed transactions and notifications. In: 9th USENIX
> Symposium on Operating Systems Design and Implementation, pp. 4–6,
> http://www.usenix.org/event/osdi10/tech/full_papers/Peng.pdf
>
> [3] http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html
>
>
> On 06/17/2016 03:00 PM, Joseph Naegele wrote:
> > Hi folks,
> >
> > I am curious as to whether Nutch 2.x might solve some of the problems we
> are
> > experiencing with Nuch 1.11 at a very large scale (multiple billions of
> > URLs). For now, the primary issue is the size of the crawldb and the
> time it
> > takes to update, as well as the time it takes to index individual
> segments.
> > I'm aware of the development on NUTCH-2184, enabling indexing without the
> > crawldb, and if we stick with Nutch 1.x I'll rely heavily on that
> feature.
> > We also compute LinkRank, which is very time-consuming, but I imagine
> that
> > won't change much.
> >
> > 1. Does Nutch 2.x's architecture alleviate some of this issue? I know,
> for
> > example, the updatedb step is intended to be much more efficient using
> Gora
> > rather than reading/writing the entire crawldb using Hadoop data
> structures.
> >
> > 2. Does there exist for 2.x any diagrams similar to Sebastian's Nutch 1.x
> > "Storage and Data Flow" diagram
> > (
> http://image.slidesharecdn.com/aceu2014-snagel-web-crawling-nutch-141125144
> >
> 922-conversion-gate01/95/web-crawling-with-apache-nutch-16-638.jpg?cb=141692
> > 7690)? I found that diagram very helpful in understanding Nutch 1.x
> > segments.
> >
> > 3. Is anyone aware of recent benchmarks comparing 1.x and 2.x,
> specifically
> > 2.3?
> >
> > I'd love to give 2.x a spin and evaluate it myself, but it would be very
> > costly to compare the two at the scale I'm referring to.
> >
> > Thanks,
> > Joe
> >
>
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>

Re: Nutch 2.x for large-scale crawls

Reply via email to