Andrzej Bialecki <a...@getopt.org> writes: > Target audience > =============== > I think that the Nutch project experiences a crisis of personality now - > we are not sure what is the target audience, and we cannot satisfy > everyone. I think that there are following groups of Nutch users: > > 1. Large-scale Internet crawl & search: actually, there are only few > such users, because it takes considerable resources to manage operations > on that scale. Scalability, manage-ability and ranking/spam prevention > are the chief concerns here.
We here at the Internet Archive are one of these users; and our numbers are small, although the size of our data is big. We routinely deal with collections of documents (primarily web pages) in excess of 500 million. We have developed a set of add-ons and modifications to Nutch called NutchWAX (Web Archive eXtensions). We use NutchWAX both for our internal projects (such as archive-it.org) as well as with our national library partners. In the coming years, more and more national libraries will be building their own web archives, mainly by performing "domain harvests" of websites in a country's domain. So, I expect the list of users to be operating at this scale to grow into to be a few dozen in the next few years. Our usage of Nutch is focused on index building and search services. We don't use the crawling/fetching features at all. We use Heritrix. Typically, our large-scale harvests are performed over 8-12 week periods, then the archived data is handed off to me for full-text search indexing. We deploy the indexes on a separate rack of machines dedicated to hosting the full-text search service. One of the biggest boons of Nutch is the Hadoop infrastructure. When indexing massive data sets, being able to fire up 60+ nodes in a Hadoop system helps tremendously. However, the one of the biggest challenges to using Nutch is the fact that the URL is used as the unique key for a document. This is usually a sensible thing to do, but for web archives, it doesn't work. Our NutchWAX package contains all sorts of hacks to work around this assumption. As for the future of Nutch, I am concerned over what I see to be an increasing focus on crawling and fetching. We have only lightly evaluated other Open Source search projects, such as Solr, and are not convinced any can be a drop-in replacement for Nutch. It looks like Solr has some nice features for certain, I'm just not convinced it can scale up to the billion document level. Aaron -- Aaron Binns Senior Software Engineer, Web Group Internet Archive aa...@archive.org