I would like to point out that Nutch is going to be very essential to our company's infrastructure-- we're definitely case #1. We'll probably have it running on 100 boxes in a few weeks.
On Tue, May 19, 2009 at 2:26 PM, Mark Olson <mark.ol...@quantum.com> wrote: > R > > ----- Original Message ----- > From: Aaron Binns <aa...@archive.org> > To: nutch-dev@lucene.apache.org <nutch-dev@lucene.apache.org> > Sent: Tue May 19 13:23:37 2009 > Subject: Re: The Future of Nutch, reactivated > > > Andrzej Bialecki <a...@getopt.org> writes: > > >> One of the biggest boons of Nutch is the Hadoop infrastructure. When > >> indexing massive data sets, being able to fire up 60+ nodes in a > >> Hadoop system helps tremendously. > > > > Are you familiar with the distributed indexing package in Hadoop > > contrib/ ? > > Only superficially at most. Last I looked at it, it seemed to be a > "hello world" prototype. If it's developed more, it might be worth > another look. > > >> However, the one of the biggest challenges to using Nutch is the fact > >> that the URL is used as the unique key for a document. > > > > Indeed, this change is something that I've been considering, too - > > URL==page doesn't work that well in case of archives, but also when > > your unit of information is smaller (pagelet) or larger (compound > > docs) than a page. > > > > People can help with this by working on a patch that replaces this > > silent assumption with an explicit API, i.e. splitting recordId and > > URL into separate fields. > > Patches always welcomed, it is an open source package after all :) I'll > see about creating a patch-set for the changes I've made in NutchWAX. > > >> As for the future of Nutch, I am concerned over what I see to be an > >> increasing focus on crawling and fetching. We have only lightly > >> evaluated other Open Source search projects, such as Solr, and are not > >> convinced any can be a drop-in replacement for Nutch. It looks like > >> Solr has some nice features for certain, I'm just not convinced it can > >> scale up to the billion document level. > > > > What do you see as the unique strength of Nutch, then? IMHO there are > > existing frameworks for distributed indexing (on Hadoop) and > > distributed search (e.g. Katta). We would like to avoid the > > duplication of effort, and to focus instead on the aspects of Nutch > > functionality that are not available elsewhere. > > Right now, the unique strength of Nutch -- to my organization -- is that > it has all the requisite pieces and comes closer to a complete solution > than other OpenSource projects. What features it lacks compared to > others are less important than the ones it has that others do not. > > Two key features of Nutch indexing are the content parsing and the link > extraction. The parsing plugins seem to work well enough, although > easier modification of content tokenizing and stop-list management would > be nice. For example, using a config file to tweak the tokenizing for > say French or Spanish would be nicer than having to write a new .jj file > and a custom build. > > Along the same lines, language-awareness would have to be included in > the query processing as well. And speaking of which, the way in which > Nutch query processing is optimized for web search makes sense. I've > read that Solr can be configured to emulate the Nutch query processing. > If so, it would eliminate a competitive advantage of Nutch. > > Nutch's summary/snippet generation approach works fine. It's not clear > to me how this is done with the other tools. > > On the search service side of things, Nutch is adequate, but I would > like to investigate other distributed search systems. My main complaint > about Nutch's implementation is the use of the Hadoop RPC mechanism. > It's very difficult to diagnose and debug problems. I'd prefer if the > master just talked to the slaves over OpenSearch or a simple HTTP/JSON > interface. This way, monitoring tools could easily ping the slaves and > check for sensible results. > > Along the same diagnosis/debug lines, I've added more log messages to > the start-up code of the search slave. Without these, it's very > difficult to diagnose some trivial mistake in the deployment of the > index/segment shards, such as a mis-named directory or the like. > > Lastly, there's also the fact that Nutch is a known quantity and we've > already put non-trivial effort into using and adapting it to our needs. > It would be difficult to start all over again with another toolset, or > assemblage of tools. We also have scaling expectations based on what > we've achieved so far with Nutch(WAX). It would be painful to invest > the time and effort in say Solr only to discover it can't scale to the > same size with the same hardware. > > > Right now, the most interesting other project for us to consider is > Solr. There seems to be more and more momentum behind it and it does > have some neat features, such as the "did you mean?" suggestions and > things. However, the distributed search functionality is pretty > rudimentary IMO and I am concerned about reports that it doesn't scale > beyond a few million or tens of millions of documents. Although it > appears that some of this has to do with the modify/update capabilities, > which are mitigated by the use of read-only IndexReaders (or something > like that). > > > Aaron > > -- > Aaron Binns > Senior Software Engineer, Web Group > Internet Archive > aa...@archive.org > ------------------------------ > The information contained in this transmission may be confidential. Any > disclosure, copying, or further distribution of confidential information is > not permitted unless such privilege is explicitly granted in writing by > Quantum Corporation. Furthermore, Quantum Corporation is not responsible for > the proper and complete transmission of the substance of this communication > or for any delay in its receipt. >