AA{hb ----- Original Message ----- From: Aaron Binns <aa...@archive.org> To: nutch-dev@lucene.apache.org <nutch-dev@lucene.apache.org> Sent: Tue May 19 13:23:37 2009 Subject: Re: The Future of Nutch, reactivated
Andrzej Bialecki <a...@getopt.org> writes: >> One of the biggest boons of Nutch is the Hadoop infrastructure. When >> indexing massive data sets, being able to fire up 60+ nodes in a >> Hadoop system helps tremendously. > > Are you familiar with the distributed indexing package in Hadoop > contrib/ ? Only superficially at most. Last I looked at it, it seemed to be a "hello world" prototype. If it's developed more, it might be worth another look. >> However, the one of the biggest challenges to using Nutch is the fact >> that the URL is used as the unique key for a document. > > Indeed, this change is something that I've been considering, too - > URL==page doesn't work that well in case of archives, but also when > your unit of information is smaller (pagelet) or larger (compound > docs) than a page. > > People can help with this by working on a patch that replaces this > silent assumption with an explicit API, i.e. splitting recordId and > URL into separate fields. Patches always welcomed, it is an open source package after all :) I'll see about creating a patch-set for the changes I've made in NutchWAX. >> As for the future of Nutch, I am concerned over what I see to be an >> increasing focus on crawling and fetching. We have only lightly >> evaluated other Open Source search projects, such as Solr, and are not >> convinced any can be a drop-in replacement for Nutch. It looks like >> Solr has some nice features for certain, I'm just not convinced it can >> scale up to the billion document level. > > What do you see as the unique strength of Nutch, then? IMHO there are > existing frameworks for distributed indexing (on Hadoop) and > distributed search (e.g. Katta). We would like to avoid the > duplication of effort, and to focus instead on the aspects of Nutch > functionality that are not available elsewhere. Right now, the unique strength of Nutch -- to my organization -- is that it has all the requisite pieces and comes closer to a complete solution than other OpenSource projects. What features it lacks compared to others are less important than the ones it has that others do not. Two key features of Nutch indexing are the content parsing and the link extraction. The parsing plugins seem to work well enough, although easier modification of content tokenizing and stop-list management would be nice. For example, using a config file to tweak the tokenizing for say French or Spanish would be nicer than having to write a new .jj file and a custom build. Along the same lines, language-awareness would have to be included in the query processing as well. And speaking of which, the way in which Nutch query processing is optimized for web search makes sense. I've read that Solr can be configured to emulate the Nutch query processing. If so, it would eliminate a competitive advantage of Nutch. Nutch's summary/snippet generation approach works fine. It's not clear to me how this is done with the other tools. On the search service side of things, Nutch is adequate, but I would like to investigate other distributed search systems. My main complaint about Nutch's implementation is the use of the Hadoop RPC mechanism. It's very difficult to diagnose and debug problems. I'd prefer if the master just talked to the slaves over OpenSearch or a simple HTTP/JSON interface. This way, monitoring tools could easily ping the slaves and check for sensible results. Along the same diagnosis/debug lines, I've added more log messages to the start-up code of the search slave. Without these, it's very difficult to diagnose some trivial mistake in the deployment of the index/segment shards, such as a mis-named directory or the like. Lastly, there's also the fact that Nutch is a known quantity and we've already put non-trivial effort into using and adapting it to our needs. It would be difficult to start all over again with another toolset, or assemblage of tools. We also have scaling expectations based on what we've achieved so far with Nutch(WAX). It would be painful to invest the time and effort in say Solr only to discover it can't scale to the same size with the same hardware. Right now, the most interesting other project for us to consider is Solr. There seems to be more and more momentum behind it and it does have some neat features, such as the "did you mean?" suggestions and things. However, the distributed search functionality is pretty rudimentary IMO and I am concerned about reports that it doesn't scale beyond a few million or tens of millions of documents. Although it appears that some of this has to do with the modify/update capabilities, which are mitigated by the use of read-only IndexReaders (or something like that). Aaron -- Aaron Binns Senior Software Engineer, Web Group Internet Archive aa...@archive.org ---------------------------------------------------------------------- The information contained in this transmission may be confidential. Any disclosure, copying, or further distribution of confidential information is not permitted unless such privilege is explicitly granted in writing by Quantum Corporation. Furthermore, Quantum Corporation is not responsible for the proper and complete transmission of the substance of this communication or for any delay in its receipt.