Re: The Future of Nutch, reactivated
Hello, (I saw the first copy of this email went to nutch-user, but I assume nutch-dev was a resend and the right list to follow-up on) I agree with the list of core competencies. For example, and I don't know where I said/wrote this, but I know I said it a few times before -- I think Solr is the future of Nutch's search. I have a feeling the original Nutch search components will die off with time - nobody is working on them, and Solr is making great progress. In my experience, most Nutch users fall under #2. Most require web-wide crawling, but really care about a specific vertical slice. So that's where I'd say the focus should be, theoretically. I say theoretically because I don't think active Nutch developers can really choose a direction if it doesn't match their own itches. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Andrzej Bialecki > To: nutch-dev@lucene.apache.org > Sent: Thursday, May 14, 2009 9:59:11 AM > Subject: The Future of Nutch, reactivated > > Hi all, > > I'd like to revive this thread and gather additional feedback so that we > end up with concrete conclusions. Much of what I write below others have > said before, I'm trying here to express this as it looks from my point > of view. > > Target audience > === > I think that the Nutch project experiences a crisis of personality now - > we are not sure what is the target audience, and we cannot satisfy > everyone. I think that there are following groups of Nutch users: > > 1. Large-scale Internet crawl & search: actually, there are only few > such users, because it takes considerable resources to manage operations > on that scale. Scalability, manage-ability and ranking/spam prevention > are the chief concerns here. > > 2. Medium-scale vertical search: I suspect that many Nutch users fall > into this category. Modularity, flexibility in implementing custom > processing, ability to modify workflows and to use only some Nutch > components seem to be chief concerns here. Scalability too, but only up > to a volume of ~100-200 mln documents. > > 3. Small- to medium-scale enterprise search: there's a sizeable number > of Nutch users that fall into this category, for historical reasons. > Link-based ranking and resource discovery are not that important here, > but integration with Windows networking, Microsoft formats and databases > , as well as realtime indexing and easy index maintenance are crucial. > This class of users often has to heavily customize Nutch to get any > sensible result. Also, this is where Solr really shines, so there is > little benefit in using Nutch here. I predict that Nutch will have fewer > and fewer users of this type. > > 4. Single desktop to small intranet search: as above, but the accent is > on the ease of use out of the box, and an often requested feature is a > GUI frontend. Currently IMHO Nutch is too complex and requires too much > command-line operation for casual users to make this use case attractive. > > What is the target audience that we as a community want to support? By > this I mean not only the moral support, but also active participation in > the development process. From the place where we are at the moment we > could go in any of the above directions. > > Core competence > === > This is a simple but important point. Currently we maintain several > major subsystems in Nutch that are implemented by other projects, and > often in a better way. Plugin framework (and dependency injection) and > content parsing are two areas that we have to delegate to third-party > libraries, such as Tika and OSGI or some other simple IOC container - > probably there are other components that we don't have to do ourselves. > Another thing that I'd love to delegate is the distributed search and > index maintenance - either through Solr or Katta or something else. > > The question then is, what is the core competence of this project? I see > the following major areas that are unique to Nutch: > > * crawling - this includes crawl scheduling (and re-crawl scheduling), > discovery and classification of new resources, strategies for crawling > specific sets of URLs (hosts and domains) under bandwidth and netiquette > constraints, etc. > > * web graph analysis - this includes link-based ranking, mirror > detection (and URL "aliasing") but also link spam detection and a more > complex control over the crawling frontier. > > Anything more? I'm not sure - perhaps I would add template detection and > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). > > Nutch 1.0 already made some steps in this di
Re: The Future of Nutch, reactivated
I would like to point out that Nutch is going to be very essential to our company's infrastructure-- we're definitely case #1. We'll probably have it running on 100 boxes in a few weeks. On Tue, May 19, 2009 at 2:26 PM, Mark Olson wrote: > R > > - Original Message - > From: Aaron Binns > To: nutch-dev@lucene.apache.org > Sent: Tue May 19 13:23:37 2009 > Subject: Re: The Future of Nutch, reactivated > > > Andrzej Bialecki writes: > > >> One of the biggest boons of Nutch is the Hadoop infrastructure. When > >> indexing massive data sets, being able to fire up 60+ nodes in a > >> Hadoop system helps tremendously. > > > > Are you familiar with the distributed indexing package in Hadoop > > contrib/ ? > > Only superficially at most. Last I looked at it, it seemed to be a > "hello world" prototype. If it's developed more, it might be worth > another look. > > >> However, the one of the biggest challenges to using Nutch is the fact > >> that the URL is used as the unique key for a document. > > > > Indeed, this change is something that I've been considering, too - > > URL==page doesn't work that well in case of archives, but also when > > your unit of information is smaller (pagelet) or larger (compound > > docs) than a page. > > > > People can help with this by working on a patch that replaces this > > silent assumption with an explicit API, i.e. splitting recordId and > > URL into separate fields. > > Patches always welcomed, it is an open source package after all :) I'll > see about creating a patch-set for the changes I've made in NutchWAX. > > >> As for the future of Nutch, I am concerned over what I see to be an > >> increasing focus on crawling and fetching. We have only lightly > >> evaluated other Open Source search projects, such as Solr, and are not > >> convinced any can be a drop-in replacement for Nutch. It looks like > >> Solr has some nice features for certain, I'm just not convinced it can > >> scale up to the billion document level. > > > > What do you see as the unique strength of Nutch, then? IMHO there are > > existing frameworks for distributed indexing (on Hadoop) and > > distributed search (e.g. Katta). We would like to avoid the > > duplication of effort, and to focus instead on the aspects of Nutch > > functionality that are not available elsewhere. > > Right now, the unique strength of Nutch -- to my organization -- is that > it has all the requisite pieces and comes closer to a complete solution > than other OpenSource projects. What features it lacks compared to > others are less important than the ones it has that others do not. > > Two key features of Nutch indexing are the content parsing and the link > extraction. The parsing plugins seem to work well enough, although > easier modification of content tokenizing and stop-list management would > be nice. For example, using a config file to tweak the tokenizing for > say French or Spanish would be nicer than having to write a new .jj file > and a custom build. > > Along the same lines, language-awareness would have to be included in > the query processing as well. And speaking of which, the way in which > Nutch query processing is optimized for web search makes sense. I've > read that Solr can be configured to emulate the Nutch query processing. > If so, it would eliminate a competitive advantage of Nutch. > > Nutch's summary/snippet generation approach works fine. It's not clear > to me how this is done with the other tools. > > On the search service side of things, Nutch is adequate, but I would > like to investigate other distributed search systems. My main complaint > about Nutch's implementation is the use of the Hadoop RPC mechanism. > It's very difficult to diagnose and debug problems. I'd prefer if the > master just talked to the slaves over OpenSearch or a simple HTTP/JSON > interface. This way, monitoring tools could easily ping the slaves and > check for sensible results. > > Along the same diagnosis/debug lines, I've added more log messages to > the start-up code of the search slave. Without these, it's very > difficult to diagnose some trivial mistake in the deployment of the > index/segment shards, such as a mis-named directory or the like. > > Lastly, there's also the fact that Nutch is a known quantity and we've > already put non-trivial effort into using and adapting it to our needs. > It would be difficult to start all over again with another toolset, or > assemblage of tools. We also have scaling ex
Re: The Future of Nutch, reactivated
R - Original Message - From: Aaron Binns To: nutch-dev@lucene.apache.org Sent: Tue May 19 13:23:37 2009 Subject: Re: The Future of Nutch, reactivated Andrzej Bialecki writes: >> One of the biggest boons of Nutch is the Hadoop infrastructure. When >> indexing massive data sets, being able to fire up 60+ nodes in a >> Hadoop system helps tremendously. > > Are you familiar with the distributed indexing package in Hadoop > contrib/ ? Only superficially at most. Last I looked at it, it seemed to be a "hello world" prototype. If it's developed more, it might be worth another look. >> However, the one of the biggest challenges to using Nutch is the fact >> that the URL is used as the unique key for a document. > > Indeed, this change is something that I've been considering, too - > URL==page doesn't work that well in case of archives, but also when > your unit of information is smaller (pagelet) or larger (compound > docs) than a page. > > People can help with this by working on a patch that replaces this > silent assumption with an explicit API, i.e. splitting recordId and > URL into separate fields. Patches always welcomed, it is an open source package after all :) I'll see about creating a patch-set for the changes I've made in NutchWAX. >> As for the future of Nutch, I am concerned over what I see to be an >> increasing focus on crawling and fetching. We have only lightly >> evaluated other Open Source search projects, such as Solr, and are not >> convinced any can be a drop-in replacement for Nutch. It looks like >> Solr has some nice features for certain, I'm just not convinced it can >> scale up to the billion document level. > > What do you see as the unique strength of Nutch, then? IMHO there are > existing frameworks for distributed indexing (on Hadoop) and > distributed search (e.g. Katta). We would like to avoid the > duplication of effort, and to focus instead on the aspects of Nutch > functionality that are not available elsewhere. Right now, the unique strength of Nutch -- to my organization -- is that it has all the requisite pieces and comes closer to a complete solution than other OpenSource projects. What features it lacks compared to others are less important than the ones it has that others do not. Two key features of Nutch indexing are the content parsing and the link extraction. The parsing plugins seem to work well enough, although easier modification of content tokenizing and stop-list management would be nice. For example, using a config file to tweak the tokenizing for say French or Spanish would be nicer than having to write a new .jj file and a custom build. Along the same lines, language-awareness would have to be included in the query processing as well. And speaking of which, the way in which Nutch query processing is optimized for web search makes sense. I've read that Solr can be configured to emulate the Nutch query processing. If so, it would eliminate a competitive advantage of Nutch. Nutch's summary/snippet generation approach works fine. It's not clear to me how this is done with the other tools. On the search service side of things, Nutch is adequate, but I would like to investigate other distributed search systems. My main complaint about Nutch's implementation is the use of the Hadoop RPC mechanism. It's very difficult to diagnose and debug problems. I'd prefer if the master just talked to the slaves over OpenSearch or a simple HTTP/JSON interface. This way, monitoring tools could easily ping the slaves and check for sensible results. Along the same diagnosis/debug lines, I've added more log messages to the start-up code of the search slave. Without these, it's very difficult to diagnose some trivial mistake in the deployment of the index/segment shards, such as a mis-named directory or the like. Lastly, there's also the fact that Nutch is a known quantity and we've already put non-trivial effort into using and adapting it to our needs. It would be difficult to start all over again with another toolset, or assemblage of tools. We also have scaling expectations based on what we've achieved so far with Nutch(WAX). It would be painful to invest the time and effort in say Solr only to discover it can't scale to the same size with the same hardware. Right now, the most interesting other project for us to consider is Solr. There seems to be more and more momentum behind it and it does have some neat features, such as the "did you mean?" suggestions and things. However, the distributed search functionality is pretty rudimentary IMO and I am concerned about reports that it doesn't scale beyond a few million or tens of millions of documents. Although it appears that som
Re: The Future of Nutch, reactivated
AA{hb - Original Message - From: Aaron Binns To: nutch-dev@lucene.apache.org Sent: Tue May 19 13:23:37 2009 Subject: Re: The Future of Nutch, reactivated Andrzej Bialecki writes: >> One of the biggest boons of Nutch is the Hadoop infrastructure. When >> indexing massive data sets, being able to fire up 60+ nodes in a >> Hadoop system helps tremendously. > > Are you familiar with the distributed indexing package in Hadoop > contrib/ ? Only superficially at most. Last I looked at it, it seemed to be a "hello world" prototype. If it's developed more, it might be worth another look. >> However, the one of the biggest challenges to using Nutch is the fact >> that the URL is used as the unique key for a document. > > Indeed, this change is something that I've been considering, too - > URL==page doesn't work that well in case of archives, but also when > your unit of information is smaller (pagelet) or larger (compound > docs) than a page. > > People can help with this by working on a patch that replaces this > silent assumption with an explicit API, i.e. splitting recordId and > URL into separate fields. Patches always welcomed, it is an open source package after all :) I'll see about creating a patch-set for the changes I've made in NutchWAX. >> As for the future of Nutch, I am concerned over what I see to be an >> increasing focus on crawling and fetching. We have only lightly >> evaluated other Open Source search projects, such as Solr, and are not >> convinced any can be a drop-in replacement for Nutch. It looks like >> Solr has some nice features for certain, I'm just not convinced it can >> scale up to the billion document level. > > What do you see as the unique strength of Nutch, then? IMHO there are > existing frameworks for distributed indexing (on Hadoop) and > distributed search (e.g. Katta). We would like to avoid the > duplication of effort, and to focus instead on the aspects of Nutch > functionality that are not available elsewhere. Right now, the unique strength of Nutch -- to my organization -- is that it has all the requisite pieces and comes closer to a complete solution than other OpenSource projects. What features it lacks compared to others are less important than the ones it has that others do not. Two key features of Nutch indexing are the content parsing and the link extraction. The parsing plugins seem to work well enough, although easier modification of content tokenizing and stop-list management would be nice. For example, using a config file to tweak the tokenizing for say French or Spanish would be nicer than having to write a new .jj file and a custom build. Along the same lines, language-awareness would have to be included in the query processing as well. And speaking of which, the way in which Nutch query processing is optimized for web search makes sense. I've read that Solr can be configured to emulate the Nutch query processing. If so, it would eliminate a competitive advantage of Nutch. Nutch's summary/snippet generation approach works fine. It's not clear to me how this is done with the other tools. On the search service side of things, Nutch is adequate, but I would like to investigate other distributed search systems. My main complaint about Nutch's implementation is the use of the Hadoop RPC mechanism. It's very difficult to diagnose and debug problems. I'd prefer if the master just talked to the slaves over OpenSearch or a simple HTTP/JSON interface. This way, monitoring tools could easily ping the slaves and check for sensible results. Along the same diagnosis/debug lines, I've added more log messages to the start-up code of the search slave. Without these, it's very difficult to diagnose some trivial mistake in the deployment of the index/segment shards, such as a mis-named directory or the like. Lastly, there's also the fact that Nutch is a known quantity and we've already put non-trivial effort into using and adapting it to our needs. It would be difficult to start all over again with another toolset, or assemblage of tools. We also have scaling expectations based on what we've achieved so far with Nutch(WAX). It would be painful to invest the time and effort in say Solr only to discover it can't scale to the same size with the same hardware. Right now, the most interesting other project for us to consider is Solr. There seems to be more and more momentum behind it and it does have some neat features, such as the "did you mean?" suggestions and things. However, the distributed search functionality is pretty rudimentary IMO and I am concerned about reports that it doesn't scale beyond a few million or tens of millions of documents. Although it appears that som
Re: The Future of Nutch, reactivated
Andrzej Bialecki writes: >> One of the biggest boons of Nutch is the Hadoop infrastructure. When >> indexing massive data sets, being able to fire up 60+ nodes in a >> Hadoop system helps tremendously. > > Are you familiar with the distributed indexing package in Hadoop > contrib/ ? Only superficially at most. Last I looked at it, it seemed to be a "hello world" prototype. If it's developed more, it might be worth another look. >> However, the one of the biggest challenges to using Nutch is the fact >> that the URL is used as the unique key for a document. > > Indeed, this change is something that I've been considering, too - > URL==page doesn't work that well in case of archives, but also when > your unit of information is smaller (pagelet) or larger (compound > docs) than a page. > > People can help with this by working on a patch that replaces this > silent assumption with an explicit API, i.e. splitting recordId and > URL into separate fields. Patches always welcomed, it is an open source package after all :) I'll see about creating a patch-set for the changes I've made in NutchWAX. >> As for the future of Nutch, I am concerned over what I see to be an >> increasing focus on crawling and fetching. We have only lightly >> evaluated other Open Source search projects, such as Solr, and are not >> convinced any can be a drop-in replacement for Nutch. It looks like >> Solr has some nice features for certain, I'm just not convinced it can >> scale up to the billion document level. > > What do you see as the unique strength of Nutch, then? IMHO there are > existing frameworks for distributed indexing (on Hadoop) and > distributed search (e.g. Katta). We would like to avoid the > duplication of effort, and to focus instead on the aspects of Nutch > functionality that are not available elsewhere. Right now, the unique strength of Nutch -- to my organization -- is that it has all the requisite pieces and comes closer to a complete solution than other OpenSource projects. What features it lacks compared to others are less important than the ones it has that others do not. Two key features of Nutch indexing are the content parsing and the link extraction. The parsing plugins seem to work well enough, although easier modification of content tokenizing and stop-list management would be nice. For example, using a config file to tweak the tokenizing for say French or Spanish would be nicer than having to write a new .jj file and a custom build. Along the same lines, language-awareness would have to be included in the query processing as well. And speaking of which, the way in which Nutch query processing is optimized for web search makes sense. I've read that Solr can be configured to emulate the Nutch query processing. If so, it would eliminate a competitive advantage of Nutch. Nutch's summary/snippet generation approach works fine. It's not clear to me how this is done with the other tools. On the search service side of things, Nutch is adequate, but I would like to investigate other distributed search systems. My main complaint about Nutch's implementation is the use of the Hadoop RPC mechanism. It's very difficult to diagnose and debug problems. I'd prefer if the master just talked to the slaves over OpenSearch or a simple HTTP/JSON interface. This way, monitoring tools could easily ping the slaves and check for sensible results. Along the same diagnosis/debug lines, I've added more log messages to the start-up code of the search slave. Without these, it's very difficult to diagnose some trivial mistake in the deployment of the index/segment shards, such as a mis-named directory or the like. Lastly, there's also the fact that Nutch is a known quantity and we've already put non-trivial effort into using and adapting it to our needs. It would be difficult to start all over again with another toolset, or assemblage of tools. We also have scaling expectations based on what we've achieved so far with Nutch(WAX). It would be painful to invest the time and effort in say Solr only to discover it can't scale to the same size with the same hardware. Right now, the most interesting other project for us to consider is Solr. There seems to be more and more momentum behind it and it does have some neat features, such as the "did you mean?" suggestions and things. However, the distributed search functionality is pretty rudimentary IMO and I am concerned about reports that it doesn't scale beyond a few million or tens of millions of documents. Although it appears that some of this has to do with the modify/update capabilities, which are mitigated by the use of read-only IndexReaders (or something like that). Aaron -- Aaron Binns Senior Software Engineer, Web Group Internet Archive aa...@archive.org
Re: The Future of Nutch, reactivated
Aaron Binns wrote: Our usage of Nutch is focused on index building and search services. We don't use the crawling/fetching features at all. We use Heritrix. Typically, our large-scale harvests are performed over 8-12 week periods, then the archived data is handed off to me for full-text search indexing. We deploy the indexes on a separate rack of machines dedicated to hosting the full-text search service. One of the biggest boons of Nutch is the Hadoop infrastructure. When indexing massive data sets, being able to fire up 60+ nodes in a Hadoop system helps tremendously. Are you familiar with the distributed indexing package in Hadoop contrib/ ? However, the one of the biggest challenges to using Nutch is the fact that the URL is used as the unique key for a document. This is usually a sensible thing to do, but for web archives, it doesn't work. Our NutchWAX package contains all sorts of hacks to work around this assumption. Indeed, this change is something that I've been considering, too - URL==page doesn't work that well in case of archives, but also when your unit of information is smaller (pagelet) or larger (compound docs) than a page. People can help with this by working on a patch that replaces this silent assumption with an explicit API, i.e. splitting recordId and URL into separate fields. As for the future of Nutch, I am concerned over what I see to be an increasing focus on crawling and fetching. We have only lightly evaluated other Open Source search projects, such as Solr, and are not convinced any can be a drop-in replacement for Nutch. It looks like Solr has some nice features for certain, I'm just not convinced it can scale up to the billion document level. What do you see as the unique strength of Nutch, then? IMHO there are existing frameworks for distributed indexing (on Hadoop) and distributed search (e.g. Katta). We would like to avoid the duplication of effort, and to focus instead on the aspects of Nutch functionality that are not available elsewhere. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: The Future of Nutch, reactivated
Andrzej Bialecki writes: > Target audience > === > I think that the Nutch project experiences a crisis of personality now - > we are not sure what is the target audience, and we cannot satisfy > everyone. I think that there are following groups of Nutch users: > > 1. Large-scale Internet crawl & search: actually, there are only few > such users, because it takes considerable resources to manage operations > on that scale. Scalability, manage-ability and ranking/spam prevention > are the chief concerns here. We here at the Internet Archive are one of these users; and our numbers are small, although the size of our data is big. We routinely deal with collections of documents (primarily web pages) in excess of 500 million. We have developed a set of add-ons and modifications to Nutch called NutchWAX (Web Archive eXtensions). We use NutchWAX both for our internal projects (such as archive-it.org) as well as with our national library partners. In the coming years, more and more national libraries will be building their own web archives, mainly by performing "domain harvests" of websites in a country's domain. So, I expect the list of users to be operating at this scale to grow into to be a few dozen in the next few years. Our usage of Nutch is focused on index building and search services. We don't use the crawling/fetching features at all. We use Heritrix. Typically, our large-scale harvests are performed over 8-12 week periods, then the archived data is handed off to me for full-text search indexing. We deploy the indexes on a separate rack of machines dedicated to hosting the full-text search service. One of the biggest boons of Nutch is the Hadoop infrastructure. When indexing massive data sets, being able to fire up 60+ nodes in a Hadoop system helps tremendously. However, the one of the biggest challenges to using Nutch is the fact that the URL is used as the unique key for a document. This is usually a sensible thing to do, but for web archives, it doesn't work. Our NutchWAX package contains all sorts of hacks to work around this assumption. As for the future of Nutch, I am concerned over what I see to be an increasing focus on crawling and fetching. We have only lightly evaluated other Open Source search projects, such as Solr, and are not convinced any can be a drop-in replacement for Nutch. It looks like Solr has some nice features for certain, I'm just not convinced it can scale up to the billion document level. Aaron -- Aaron Binns Senior Software Engineer, Web Group Internet Archive aa...@archive.org
The Future of Nutch, reactivated
All, Sorry that I didn't reply, and thus this isn't threaded properly. I've lurked on the list via the RSS feed, I subscribed so I could put in my two cents worth. I've recently starting using git to maintain a local branch of Nutch. My hope is to get my employer to let me contribute "just engineering" back to Nutch. We'd like to customize Nutch in various ways and use that as the basis of internal R&D and potentially some products that we'd not contribute. The other things that just make Nutch more flexible I'd like to contribute to. I've been working with Nutch on and off since sometime in November or so for my job. A couple of thoughts: 1. Nutch is too monolithic 2. Nutch does the heavy lifting of a framework for a distributed system well. 3. Nutch doesn't really keep all the various pieces up to date very well. 4. Nutch requires at least a Bachelors in Nutch to deal with it. 5. Documentation in a Wiki is out of date or is hard to tell which versions various things work with. 6. Nutch isn't very friendly to simple requests if there a complex hack could be found. (See recursive file:// handling). My most recent task was actually to update Tika to use 0.3 and then use the Tika parsing of the docx format to index. There were a several interesting problems, but I want to get permission from my employer and just show the patches. I thing we fall into the category of #2 (we wish we could fall into category #1, but such is life). We want to make our intranet searchable on a large scale, and would like to apply the indexing and retrieval in a number of R&D projects. We also have an interest in using Nutch/Lucene/Hadoop in a number of other problems unrelated to Internet Search. A couple of things that I'd like to help do (or see done) that would make Nutch far more framework like so I can assemble the pieces and parts into what I need: 1. Get Nutch and it's various components into a public Maven repository, and have public scripts to do the publishing. Don't care if that is via Ant with Ivy extensions, or switching to a Maven build systems. I've actually started with both approaches. I'm much better with Maven, but I think Ivy is more likely to be acceptable to the project. I'd like to see this done with Hadoop, and any other core components. For now, I'm just maintaining a local POM file that pushes my builds into our local Maven repository. I'm going to do this one way or another, and would love to hear any feedback on an approach that is acceptable to be contributed back to Nutch. 2. Clearly segregate "Plugins" from "Core" from "Bits that make it an Application". I've had fun problems with ClassLoaders, and it seems that the interface Plugins are allowed to access are "Anything in Core, or it's existing libraries". It would seem that it would be better to have the Core Runtime, which plugins can depend upon, and is relatively minimal. Identify the pieces of Nutch which are there to make it into a program you can run, and push those into a separate place. For API's with multiple implementations, it would be nice to not have be forced to use the same one the Core does when a plugin is written. 3. As you stated earlier, use OSGi for a plugin system and some type of dependency injection rather then hand parsed XML files. I've had problems with the PluginClassloader (I wanted to use Tika in my plugin, and because of the plugin/classloader setup, I had to push the POI libraries into the lib directory rather then in the src/plugin/plugin-XXX/lib directory). Well, that was the first approach, the second was to hack the PluginClassloader to not delegate to the parent for the "org.apache.tika" package and then provide Tika in the plugin and it all worked. Using an well known plug-in system would have made this much easier. 4. Help transition to using the 3rd party libraries, Nutch still has an SWF parser that went unmaintained in 2002. Flash has moved a long way, it would seem sensible to either jettison that code, or update to newer versions of the same library by the same project (SWF2). Not that I care about Flash, but it seems that parsing isn't something Nutch proper is focused on. 5. With whatever build system is chosen, figure out how to setup a Maven build to construct "Out-of-Tree" Nutch plugins without having to manually deal with all of the various dependencies and packaging details. 6. Better support for running out of an IDE. The instructions work, and are very helpful. It'd be much nicer to see the use of tools or scripts to generate a saner system then is currently there (having each plugin be a project in Eclipse would be a huge help to debugging weird classpath issues). Right now, running and compiling inside of Eclipse isn't at all similar to running it outside, if you have any time of classloader issues, or multiple conflicting libraries. Not that there are any in-tree right now, but I can see how future ones could exist. 7. Make each plugin be it's own delive
Re: The Future of Nutch, reactivated
Hi Andrzej, Great summary. My general feeling on this is similar to my prior comments on similar threads from Otis and from Dennis. My personal pet projects for Nutch2: * refactored Nutch core data structures, modeled as POJOs * refactored Nutch architecture where crawling/indexing/parsing/scoring/etc. are insulated from the underlying messaging substrate (e.g., crawl over JMS, EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other framework, etc.) * simpler Nutch deployment mechanisms (separate Nutch deployment package from source code package), think about using Maven2 +1 to all of those and other ideas for how to improve the project's focus. Cheers, Chris On 5/14/09 6:45 AM, "Andrzej Bialecki" wrote: > Hi all, > > I'd like to revive this thread and gather additional feedback so that we > end up with concrete conclusions. Much of what I write below others have > said before, I'm trying here to express this as it looks from my point > of view. > > Target audience > === > I think that the Nutch project experiences a crisis of personality now - > we are not sure what is the target audience, and we cannot satisfy > everyone. I think that there are following groups of Nutch users: > > 1. Large-scale Internet crawl & search: actually, there are only few > such users, because it takes considerable resources to manage operations > on that scale. Scalability, manage-ability and ranking/spam prevention > are the chief concerns here. > > 2. Medium-scale vertical search: I suspect that many Nutch users fall > into this category. Modularity, flexibility in implementing custom > processing, ability to modify workflows and to use only some Nutch > components seem to be chief concerns here. Scalability too, but only up > to a volume of ~100-200 mln documents. > > 3. Small- to medium-scale enterprise search: there's a sizeable number > of Nutch users that fall into this category, for historical reasons. > Link-based ranking and resource discovery are not that important here, > but integration with Windows networking, Microsoft formats and databases > , as well as realtime indexing and easy index maintenance are crucial. > This class of users often has to heavily customize Nutch to get any > sensible result. Also, this is where Solr really shines, so there is > little benefit in using Nutch here. I predict that Nutch will have fewer > and fewer users of this type. > > 4. Single desktop to small intranet search: as above, but the accent is > on the ease of use out of the box, and an often requested feature is a > GUI frontend. Currently IMHO Nutch is too complex and requires too much > command-line operation for casual users to make this use case attractive. > > What is the target audience that we as a community want to support? By > this I mean not only the moral support, but also active participation in > the development process. From the place where we are at the moment we > could go in any of the above directions. > > Core competence > === > This is a simple but important point. Currently we maintain several > major subsystems in Nutch that are implemented by other projects, and > often in a better way. Plugin framework (and dependency injection) and > content parsing are two areas that we have to delegate to third-party > libraries, such as Tika and OSGI or some other simple IOC container - > probably there are other components that we don't have to do ourselves. > Another thing that I'd love to delegate is the distributed search and > index maintenance - either through Solr or Katta or something else. > > The question then is, what is the core competence of this project? I see > the following major areas that are unique to Nutch: > > * crawling - this includes crawl scheduling (and re-crawl scheduling), > discovery and classification of new resources, strategies for crawling > specific sets of URLs (hosts and domains) under bandwidth and netiquette > constraints, etc. > > * web graph analysis - this includes link-based ranking, mirror > detection (and URL "aliasing") but also link spam detection and a more > complex control over the crawling frontier. > > Anything more? I'm not sure - perhaps I would add template detection and > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). > > Nutch 1.0 already made some steps in this direction, with the new link > analysis package and pluggable FetchSchedule and Signature. A lot > remains to be done here, and we are still spending a lot of resources on > dealing with issues outside this core competence. > > --- > > So, what do we need to do next? > > * we need to decide where we should commit our resources, as a community > of users, contributors and committers, so that the project is most > useful to our target audience. At this point there are few active > committers, so I don't think we can cover more than 1 direction at a > time ... ;) > > * we need to re-architect Nutch to focu
The Future of Nutch, reactivated
Hi all, I'd like to revive this thread and gather additional feedback so that we end up with concrete conclusions. Much of what I write below others have said before, I'm trying here to express this as it looks from my point of view. Target audience === I think that the Nutch project experiences a crisis of personality now - we are not sure what is the target audience, and we cannot satisfy everyone. I think that there are following groups of Nutch users: 1. Large-scale Internet crawl & search: actually, there are only few such users, because it takes considerable resources to manage operations on that scale. Scalability, manage-ability and ranking/spam prevention are the chief concerns here. 2. Medium-scale vertical search: I suspect that many Nutch users fall into this category. Modularity, flexibility in implementing custom processing, ability to modify workflows and to use only some Nutch components seem to be chief concerns here. Scalability too, but only up to a volume of ~100-200 mln documents. 3. Small- to medium-scale enterprise search: there's a sizeable number of Nutch users that fall into this category, for historical reasons. Link-based ranking and resource discovery are not that important here, but integration with Windows networking, Microsoft formats and databases , as well as realtime indexing and easy index maintenance are crucial. This class of users often has to heavily customize Nutch to get any sensible result. Also, this is where Solr really shines, so there is little benefit in using Nutch here. I predict that Nutch will have fewer and fewer users of this type. 4. Single desktop to small intranet search: as above, but the accent is on the ease of use out of the box, and an often requested feature is a GUI frontend. Currently IMHO Nutch is too complex and requires too much command-line operation for casual users to make this use case attractive. What is the target audience that we as a community want to support? By this I mean not only the moral support, but also active participation in the development process. From the place where we are at the moment we could go in any of the above directions. Core competence === This is a simple but important point. Currently we maintain several major subsystems in Nutch that are implemented by other projects, and often in a better way. Plugin framework (and dependency injection) and content parsing are two areas that we have to delegate to third-party libraries, such as Tika and OSGI or some other simple IOC container - probably there are other components that we don't have to do ourselves. Another thing that I'd love to delegate is the distributed search and index maintenance - either through Solr or Katta or something else. The question then is, what is the core competence of this project? I see the following major areas that are unique to Nutch: * crawling - this includes crawl scheduling (and re-crawl scheduling), discovery and classification of new resources, strategies for crawling specific sets of URLs (hosts and domains) under bandwidth and netiquette constraints, etc. * web graph analysis - this includes link-based ranking, mirror detection (and URL "aliasing") but also link spam detection and a more complex control over the crawling frontier. Anything more? I'm not sure - perhaps I would add template detection and pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). Nutch 1.0 already made some steps in this direction, with the new link analysis package and pluggable FetchSchedule and Signature. A lot remains to be done here, and we are still spending a lot of resources on dealing with issues outside this core competence. --- So, what do we need to do next? * we need to decide where we should commit our resources, as a community of users, contributors and committers, so that the project is most useful to our target audience. At this point there are few active committers, so I don't think we can cover more than 1 direction at a time ... ;) * we need to re-architect Nutch to focus on our core competence, and delegate what we can to other projects. Feel free to comment on the above, make suggestions or corrections. I'd like to wrap it up in a concise mission statement that would help us set the goals for the next couple months. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com