Re: The Future of Nutch, reactivated

Aaron Binns Mon, 18 May 2009 12:02:29 -0700

Andrzej Bialecki <a...@getopt.org> writes:

> Target audience
> ===============
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.


We here at the Internet Archive are one of these users; and our numbers
are small, although the size of our data is big.  We routinely deal with
collections of documents (primarily web pages) in excess of 500 million.

We have developed a set of add-ons and modifications to Nutch called
NutchWAX (Web Archive eXtensions).  We use NutchWAX both for our
internal projects (such as archive-it.org) as well as with our national
library partners.

In the coming years, more and more national libraries will be building
their own web archives, mainly by performing "domain harvests" of
websites in a country's domain.  So, I expect the list of users to be
operating at this scale to grow into to be a few dozen in the next few
years.

Our usage of Nutch is focused on index building and search services.  We
don't use the crawling/fetching features at all.  We use Heritrix.
Typically, our large-scale harvests are performed over 8-12 week
periods, then the archived data is handed off to me for full-text search
indexing.  We deploy the indexes on a separate rack of machines
dedicated to hosting the full-text search service.

One of the biggest boons of Nutch is the Hadoop infrastructure.  When
indexing massive data sets, being able to fire up 60+ nodes in a Hadoop
system helps tremendously.

However, the one of the biggest challenges to using Nutch is the fact
that the URL is used as the unique key for a document.  This is usually
a sensible thing to do, but for web archives, it doesn't work.  Our
NutchWAX package contains all sorts of hacks to work around this
assumption.


As for the future of Nutch, I am concerned over what I see to be an
increasing focus on crawling and fetching.  We have only lightly
evaluated other Open Source search projects, such as Solr, and are not
convinced any can be a drop-in replacement for Nutch.  It looks like
Solr has some nice features for certain, I'm just not convinced it can
scale up to the billion document level.


Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aa...@archive.org

Re: The Future of Nutch, reactivated

Reply via email to