Re: The Future of Nutch, reactivated

2009-05-23 Thread Otis Gospodnetic

Hello,
(I saw the first copy of this email went to nutch-user, but I assume nutch-dev 
was a resend and the right list to follow-up on)

I agree with the list of core competencies.  For example, and I don't know 
where I said/wrote this, but I know I said it a few times before -- I think 
Solr is the future of Nutch's search.  I have a feeling the original Nutch 
search components will die off with time - nobody is working on them, and Solr 
is making great progress.

In my experience, most Nutch users fall under #2.  Most require web-wide 
crawling, but really care about a specific vertical slice.  So that's where I'd 
say the focus should be, theoretically.  I say theoretically because I don't 
think active Nutch developers can really choose a direction if it doesn't match 
their own itches.


Otis 
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
> From: Andrzej Bialecki 
> To: nutch-dev@lucene.apache.org
> Sent: Thursday, May 14, 2009 9:59:11 AM
> Subject: The Future of Nutch, reactivated
> 
> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this di

Re: The Future of Nutch, reactivated

2009-05-19 Thread Bradford Stephens
I would like to point out that Nutch is going to be very essential to our
company's infrastructure-- we're definitely case #1. We'll probably have it
running on 100 boxes in a few weeks.

On Tue, May 19, 2009 at 2:26 PM, Mark Olson  wrote:

>  R
>
> - Original Message -
> From: Aaron Binns 
> To: nutch-dev@lucene.apache.org 
> Sent: Tue May 19 13:23:37 2009
> Subject: Re: The Future of Nutch, reactivated
>
>
> Andrzej Bialecki  writes:
>
> >> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
> >> indexing massive data sets, being able to fire up 60+ nodes in a
> >> Hadoop system helps tremendously.
> >
> > Are you familiar with the distributed indexing package in Hadoop
> > contrib/ ?
>
> Only superficially at most.  Last I looked at it, it seemed to be a
> "hello world" prototype.  If it's developed more, it might be worth
> another look.
>
> >> However, the one of the biggest challenges to using Nutch is the fact
> >> that the URL is used as the unique key for a document.
> >
> > Indeed, this change is something that I've been considering, too -
> > URL==page doesn't work that well in case of archives, but also when
> > your unit of information is smaller (pagelet) or larger (compound
> > docs) than a page.
> >
> > People can help with this by working on a patch that replaces this
> > silent assumption with an explicit API, i.e. splitting recordId and
> > URL into separate fields.
>
> Patches always welcomed, it is an open source package after all :) I'll
> see about creating a patch-set for the changes I've made in NutchWAX.
>
> >> As for the future of Nutch, I am concerned over what I see to be an
> >> increasing focus on crawling and fetching.  We have only lightly
> >> evaluated other Open Source search projects, such as Solr, and are not
> >> convinced any can be a drop-in replacement for Nutch.  It looks like
> >> Solr has some nice features for certain, I'm just not convinced it can
> >> scale up to the billion document level.
> >
> > What do you see as the unique strength of Nutch, then? IMHO there are
> > existing frameworks for distributed indexing (on Hadoop) and
> > distributed search (e.g. Katta). We would like to avoid the
> > duplication of effort, and to focus instead on the aspects of Nutch
> > functionality that are not available elsewhere.
>
> Right now, the unique strength of Nutch -- to my organization -- is that
> it has all the requisite pieces and comes closer to a complete solution
> than other OpenSource projects.  What features it lacks compared to
> others are less important than the ones it has that others do not.
>
> Two key features of Nutch indexing are the content parsing and the link
> extraction.  The parsing plugins seem to work well enough, although
> easier modification of content tokenizing and stop-list management would
> be nice.  For example, using a config file to tweak the tokenizing for
> say French or Spanish would be nicer than having to write a new .jj file
> and a custom build.
>
> Along the same lines, language-awareness would have to be included in
> the query processing as well.  And speaking of which, the way in which
> Nutch query processing is optimized for web search makes sense.  I've
> read that Solr can be configured to emulate the Nutch query processing.
> If so, it would eliminate a competitive advantage of Nutch.
>
> Nutch's summary/snippet generation approach works fine.  It's not clear
> to me how this is done with the other tools.
>
> On the search service side of things, Nutch is adequate, but I would
> like to investigate other distributed search systems.  My main complaint
> about Nutch's implementation is the use of the Hadoop RPC mechanism.
> It's very difficult to diagnose and debug problems.  I'd prefer if the
> master just talked to the slaves over OpenSearch or a simple HTTP/JSON
> interface.  This way, monitoring tools could easily ping the slaves and
> check for sensible results.
>
> Along the same diagnosis/debug lines, I've added more log messages to
> the start-up code of the search slave.  Without these, it's very
> difficult to diagnose some trivial mistake in the deployment of the
> index/segment shards, such as a mis-named directory or the like.
>
> Lastly, there's also the fact that Nutch is a known quantity and we've
> already put non-trivial effort into using and adapting it to our needs.
> It would be difficult to start all over again with another toolset, or
> assemblage of tools.  We also have scaling ex

Re: The Future of Nutch, reactivated

2009-05-19 Thread Mark Olson
R

- Original Message -
From: Aaron Binns 
To: nutch-dev@lucene.apache.org 
Sent: Tue May 19 13:23:37 2009
Subject: Re: The Future of Nutch, reactivated


Andrzej Bialecki  writes:

>> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
>> indexing massive data sets, being able to fire up 60+ nodes in a
>> Hadoop system helps tremendously.
>
> Are you familiar with the distributed indexing package in Hadoop
> contrib/ ?

Only superficially at most.  Last I looked at it, it seemed to be a
"hello world" prototype.  If it's developed more, it might be worth
another look.

>> However, the one of the biggest challenges to using Nutch is the fact
>> that the URL is used as the unique key for a document.
>
> Indeed, this change is something that I've been considering, too - 
> URL==page doesn't work that well in case of archives, but also when
> your unit of information is smaller (pagelet) or larger (compound
> docs) than a page.
>
> People can help with this by working on a patch that replaces this
> silent assumption with an explicit API, i.e. splitting recordId and
> URL into separate fields.

Patches always welcomed, it is an open source package after all :) I'll
see about creating a patch-set for the changes I've made in NutchWAX.

>> As for the future of Nutch, I am concerned over what I see to be an
>> increasing focus on crawling and fetching.  We have only lightly
>> evaluated other Open Source search projects, such as Solr, and are not
>> convinced any can be a drop-in replacement for Nutch.  It looks like
>> Solr has some nice features for certain, I'm just not convinced it can
>> scale up to the billion document level.
>
> What do you see as the unique strength of Nutch, then? IMHO there are
> existing frameworks for distributed indexing (on Hadoop) and
> distributed search (e.g. Katta). We would like to avoid the
> duplication of effort, and to focus instead on the aspects of Nutch
> functionality that are not available elsewhere.

Right now, the unique strength of Nutch -- to my organization -- is that
it has all the requisite pieces and comes closer to a complete solution
than other OpenSource projects.  What features it lacks compared to
others are less important than the ones it has that others do not.

Two key features of Nutch indexing are the content parsing and the link
extraction.  The parsing plugins seem to work well enough, although
easier modification of content tokenizing and stop-list management would
be nice.  For example, using a config file to tweak the tokenizing for
say French or Spanish would be nicer than having to write a new .jj file
and a custom build.

Along the same lines, language-awareness would have to be included in
the query processing as well.  And speaking of which, the way in which
Nutch query processing is optimized for web search makes sense.  I've
read that Solr can be configured to emulate the Nutch query processing.
If so, it would eliminate a competitive advantage of Nutch.

Nutch's summary/snippet generation approach works fine.  It's not clear
to me how this is done with the other tools.

On the search service side of things, Nutch is adequate, but I would
like to investigate other distributed search systems.  My main complaint
about Nutch's implementation is the use of the Hadoop RPC mechanism.
It's very difficult to diagnose and debug problems.  I'd prefer if the
master just talked to the slaves over OpenSearch or a simple HTTP/JSON
interface.  This way, monitoring tools could easily ping the slaves and
check for sensible results.

Along the same diagnosis/debug lines, I've added more log messages to
the start-up code of the search slave.  Without these, it's very
difficult to diagnose some trivial mistake in the deployment of the
index/segment shards, such as a mis-named directory or the like.

Lastly, there's also the fact that Nutch is a known quantity and we've
already put non-trivial effort into using and adapting it to our needs.
It would be difficult to start all over again with another toolset, or
assemblage of tools.  We also have scaling expectations based on what
we've achieved so far with Nutch(WAX).  It would be painful to invest
the time and effort in say Solr only to discover it can't scale to the
same size with the same hardware.


Right now, the most interesting other project for us to consider is
Solr.  There seems to be more and more momentum behind it and it does
have some neat features, such as the "did you mean?" suggestions and
things.  However, the distributed search functionality is pretty
rudimentary IMO and I am concerned about reports that it doesn't scale
beyond a few million or tens of millions of documents.  Although it
appears that som

Re: The Future of Nutch, reactivated

2009-05-19 Thread Mark Olson
AA{hb

- Original Message -
From: Aaron Binns 
To: nutch-dev@lucene.apache.org 
Sent: Tue May 19 13:23:37 2009
Subject: Re: The Future of Nutch, reactivated


Andrzej Bialecki  writes:

>> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
>> indexing massive data sets, being able to fire up 60+ nodes in a
>> Hadoop system helps tremendously.
>
> Are you familiar with the distributed indexing package in Hadoop
> contrib/ ?

Only superficially at most.  Last I looked at it, it seemed to be a
"hello world" prototype.  If it's developed more, it might be worth
another look.

>> However, the one of the biggest challenges to using Nutch is the fact
>> that the URL is used as the unique key for a document.
>
> Indeed, this change is something that I've been considering, too - 
> URL==page doesn't work that well in case of archives, but also when
> your unit of information is smaller (pagelet) or larger (compound
> docs) than a page.
>
> People can help with this by working on a patch that replaces this
> silent assumption with an explicit API, i.e. splitting recordId and
> URL into separate fields.

Patches always welcomed, it is an open source package after all :) I'll
see about creating a patch-set for the changes I've made in NutchWAX.

>> As for the future of Nutch, I am concerned over what I see to be an
>> increasing focus on crawling and fetching.  We have only lightly
>> evaluated other Open Source search projects, such as Solr, and are not
>> convinced any can be a drop-in replacement for Nutch.  It looks like
>> Solr has some nice features for certain, I'm just not convinced it can
>> scale up to the billion document level.
>
> What do you see as the unique strength of Nutch, then? IMHO there are
> existing frameworks for distributed indexing (on Hadoop) and
> distributed search (e.g. Katta). We would like to avoid the
> duplication of effort, and to focus instead on the aspects of Nutch
> functionality that are not available elsewhere.

Right now, the unique strength of Nutch -- to my organization -- is that
it has all the requisite pieces and comes closer to a complete solution
than other OpenSource projects.  What features it lacks compared to
others are less important than the ones it has that others do not.

Two key features of Nutch indexing are the content parsing and the link
extraction.  The parsing plugins seem to work well enough, although
easier modification of content tokenizing and stop-list management would
be nice.  For example, using a config file to tweak the tokenizing for
say French or Spanish would be nicer than having to write a new .jj file
and a custom build.

Along the same lines, language-awareness would have to be included in
the query processing as well.  And speaking of which, the way in which
Nutch query processing is optimized for web search makes sense.  I've
read that Solr can be configured to emulate the Nutch query processing.
If so, it would eliminate a competitive advantage of Nutch.

Nutch's summary/snippet generation approach works fine.  It's not clear
to me how this is done with the other tools.

On the search service side of things, Nutch is adequate, but I would
like to investigate other distributed search systems.  My main complaint
about Nutch's implementation is the use of the Hadoop RPC mechanism.
It's very difficult to diagnose and debug problems.  I'd prefer if the
master just talked to the slaves over OpenSearch or a simple HTTP/JSON
interface.  This way, monitoring tools could easily ping the slaves and
check for sensible results.

Along the same diagnosis/debug lines, I've added more log messages to
the start-up code of the search slave.  Without these, it's very
difficult to diagnose some trivial mistake in the deployment of the
index/segment shards, such as a mis-named directory or the like.

Lastly, there's also the fact that Nutch is a known quantity and we've
already put non-trivial effort into using and adapting it to our needs.
It would be difficult to start all over again with another toolset, or
assemblage of tools.  We also have scaling expectations based on what
we've achieved so far with Nutch(WAX).  It would be painful to invest
the time and effort in say Solr only to discover it can't scale to the
same size with the same hardware.


Right now, the most interesting other project for us to consider is
Solr.  There seems to be more and more momentum behind it and it does
have some neat features, such as the "did you mean?" suggestions and
things.  However, the distributed search functionality is pretty
rudimentary IMO and I am concerned about reports that it doesn't scale
beyond a few million or tens of millions of documents.  Although it
appears that som

Re: The Future of Nutch, reactivated

2009-05-19 Thread Aaron Binns

Andrzej Bialecki  writes:

>> One of the biggest boons of Nutch is the Hadoop infrastructure.  When
>> indexing massive data sets, being able to fire up 60+ nodes in a
>> Hadoop system helps tremendously.
>
> Are you familiar with the distributed indexing package in Hadoop
> contrib/ ?

Only superficially at most.  Last I looked at it, it seemed to be a
"hello world" prototype.  If it's developed more, it might be worth
another look.

>> However, the one of the biggest challenges to using Nutch is the fact
>> that the URL is used as the unique key for a document.
>
> Indeed, this change is something that I've been considering, too - 
> URL==page doesn't work that well in case of archives, but also when
> your unit of information is smaller (pagelet) or larger (compound
> docs) than a page.
>
> People can help with this by working on a patch that replaces this
> silent assumption with an explicit API, i.e. splitting recordId and
> URL into separate fields.

Patches always welcomed, it is an open source package after all :) I'll
see about creating a patch-set for the changes I've made in NutchWAX.

>> As for the future of Nutch, I am concerned over what I see to be an
>> increasing focus on crawling and fetching.  We have only lightly
>> evaluated other Open Source search projects, such as Solr, and are not
>> convinced any can be a drop-in replacement for Nutch.  It looks like
>> Solr has some nice features for certain, I'm just not convinced it can
>> scale up to the billion document level.
>
> What do you see as the unique strength of Nutch, then? IMHO there are
> existing frameworks for distributed indexing (on Hadoop) and
> distributed search (e.g. Katta). We would like to avoid the
> duplication of effort, and to focus instead on the aspects of Nutch
> functionality that are not available elsewhere.

Right now, the unique strength of Nutch -- to my organization -- is that
it has all the requisite pieces and comes closer to a complete solution
than other OpenSource projects.  What features it lacks compared to
others are less important than the ones it has that others do not.

Two key features of Nutch indexing are the content parsing and the link
extraction.  The parsing plugins seem to work well enough, although
easier modification of content tokenizing and stop-list management would
be nice.  For example, using a config file to tweak the tokenizing for
say French or Spanish would be nicer than having to write a new .jj file
and a custom build.

Along the same lines, language-awareness would have to be included in
the query processing as well.  And speaking of which, the way in which
Nutch query processing is optimized for web search makes sense.  I've
read that Solr can be configured to emulate the Nutch query processing.
If so, it would eliminate a competitive advantage of Nutch.

Nutch's summary/snippet generation approach works fine.  It's not clear
to me how this is done with the other tools.

On the search service side of things, Nutch is adequate, but I would
like to investigate other distributed search systems.  My main complaint
about Nutch's implementation is the use of the Hadoop RPC mechanism.
It's very difficult to diagnose and debug problems.  I'd prefer if the
master just talked to the slaves over OpenSearch or a simple HTTP/JSON
interface.  This way, monitoring tools could easily ping the slaves and
check for sensible results.

Along the same diagnosis/debug lines, I've added more log messages to
the start-up code of the search slave.  Without these, it's very
difficult to diagnose some trivial mistake in the deployment of the
index/segment shards, such as a mis-named directory or the like.

Lastly, there's also the fact that Nutch is a known quantity and we've
already put non-trivial effort into using and adapting it to our needs.
It would be difficult to start all over again with another toolset, or
assemblage of tools.  We also have scaling expectations based on what
we've achieved so far with Nutch(WAX).  It would be painful to invest
the time and effort in say Solr only to discover it can't scale to the
same size with the same hardware.


Right now, the most interesting other project for us to consider is
Solr.  There seems to be more and more momentum behind it and it does
have some neat features, such as the "did you mean?" suggestions and
things.  However, the distributed search functionality is pretty
rudimentary IMO and I am concerned about reports that it doesn't scale
beyond a few million or tens of millions of documents.  Although it
appears that some of this has to do with the modify/update capabilities,
which are mitigated by the use of read-only IndexReaders (or something
like that).


Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aa...@archive.org


Re: The Future of Nutch, reactivated

2009-05-19 Thread Andrzej Bialecki

Aaron Binns wrote:


Our usage of Nutch is focused on index building and search services.  We
don't use the crawling/fetching features at all.  We use Heritrix.
Typically, our large-scale harvests are performed over 8-12 week
periods, then the archived data is handed off to me for full-text search
indexing.  We deploy the indexes on a separate rack of machines
dedicated to hosting the full-text search service.

One of the biggest boons of Nutch is the Hadoop infrastructure.  When
indexing massive data sets, being able to fire up 60+ nodes in a Hadoop
system helps tremendously.


Are you familiar with the distributed indexing package in Hadoop contrib/ ?



However, the one of the biggest challenges to using Nutch is the fact
that the URL is used as the unique key for a document.  This is usually
a sensible thing to do, but for web archives, it doesn't work.  Our
NutchWAX package contains all sorts of hacks to work around this
assumption.


Indeed, this change is something that I've been considering, too - 
URL==page doesn't work that well in case of archives, but also when your 
unit of information is smaller (pagelet) or larger (compound docs) than 
a page.


People can help with this by working on a patch that replaces this 
silent assumption with an explicit API, i.e. splitting recordId and URL 
into separate fields.





As for the future of Nutch, I am concerned over what I see to be an
increasing focus on crawling and fetching.  We have only lightly
evaluated other Open Source search projects, such as Solr, and are not
convinced any can be a drop-in replacement for Nutch.  It looks like
Solr has some nice features for certain, I'm just not convinced it can
scale up to the billion document level.


What do you see as the unique strength of Nutch, then? IMHO there are 
existing frameworks for distributed indexing (on Hadoop) and distributed 
search (e.g. Katta). We would like to avoid the duplication of effort, 
and to focus instead on the aspects of Nutch functionality that are not 
available elsewhere.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: The Future of Nutch, reactivated

2009-05-18 Thread Aaron Binns

Andrzej Bialecki  writes:

> Target audience
> ===
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
>
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.

We here at the Internet Archive are one of these users; and our numbers
are small, although the size of our data is big.  We routinely deal with
collections of documents (primarily web pages) in excess of 500 million.

We have developed a set of add-ons and modifications to Nutch called
NutchWAX (Web Archive eXtensions).  We use NutchWAX both for our
internal projects (such as archive-it.org) as well as with our national
library partners.

In the coming years, more and more national libraries will be building
their own web archives, mainly by performing "domain harvests" of
websites in a country's domain.  So, I expect the list of users to be
operating at this scale to grow into to be a few dozen in the next few
years.

Our usage of Nutch is focused on index building and search services.  We
don't use the crawling/fetching features at all.  We use Heritrix.
Typically, our large-scale harvests are performed over 8-12 week
periods, then the archived data is handed off to me for full-text search
indexing.  We deploy the indexes on a separate rack of machines
dedicated to hosting the full-text search service.

One of the biggest boons of Nutch is the Hadoop infrastructure.  When
indexing massive data sets, being able to fire up 60+ nodes in a Hadoop
system helps tremendously.

However, the one of the biggest challenges to using Nutch is the fact
that the URL is used as the unique key for a document.  This is usually
a sensible thing to do, but for web archives, it doesn't work.  Our
NutchWAX package contains all sorts of hacks to work around this
assumption.


As for the future of Nutch, I am concerned over what I see to be an
increasing focus on crawling and fetching.  We have only lightly
evaluated other Open Source search projects, such as Solr, and are not
convinced any can be a drop-in replacement for Nutch.  It looks like
Solr has some nice features for certain, I'm just not convinced it can
scale up to the billion document level.


Aaron

-- 
Aaron Binns
Senior Software Engineer, Web Group
Internet Archive
aa...@archive.org


The Future of Nutch, reactivated

2009-05-14 Thread Kirby Bohling
All,

Sorry that I didn't reply, and thus this isn't threaded properly.
I've lurked on the list via the RSS feed, I subscribed so I could put
in my two cents worth.  I've recently starting using git to maintain a
local branch of Nutch.  My hope is to get my employer to let me
contribute "just engineering" back to Nutch.  We'd like to customize
Nutch in various ways and use that as the basis of internal R&D and
potentially some products that we'd not contribute.  The other things
that just make Nutch more flexible I'd like to contribute to.

I've been working with Nutch on and off since sometime in November or
so for my job.  A couple of thoughts:

1. Nutch is too monolithic
2. Nutch does the heavy lifting of a framework for a distributed system well.
3. Nutch doesn't really keep all the various pieces up to date very well.
4. Nutch requires at least a Bachelors in Nutch to deal with it.
5. Documentation in a Wiki is out of date or is hard to tell which
versions various things work with.
6. Nutch isn't very friendly to simple requests if there a complex
hack could be found. (See recursive file:// handling).

My most recent task was actually to update Tika to use 0.3 and then
use the Tika parsing of the docx format to index.  There were a
several interesting problems, but I want to get permission from my
employer and just show the patches.

I thing we fall into the category of #2 (we wish we could fall into
category #1, but such is life).  We want to make our intranet
searchable on a large scale, and would like to apply the indexing and
retrieval in a number of R&D projects.  We also have an interest in
using Nutch/Lucene/Hadoop in a number of other problems unrelated to
Internet Search.

A couple of things that I'd like to help do (or see done) that would
make Nutch far more framework like so I can assemble the pieces and
parts into what I need:

1. Get Nutch and it's various components into a public Maven
repository, and have public scripts to do the publishing.  Don't care
if that is via Ant with Ivy extensions, or switching to a Maven build
systems.  I've actually started with both approaches.  I'm much better
with Maven, but I think Ivy is more likely to be acceptable to the
project.  I'd like to see this done with Hadoop, and any other core
components.  For now, I'm just maintaining a local POM file that
pushes my builds into our local Maven repository.  I'm going to do
this one way or another, and would love to hear any feedback on an
approach that is acceptable to be contributed back to Nutch.

2. Clearly segregate "Plugins" from "Core" from "Bits that make it an
Application".  I've had fun problems with ClassLoaders, and it seems
that the interface Plugins are allowed to access are "Anything in
Core, or it's existing libraries".  It would seem that it would be
better to have the Core Runtime, which plugins can depend upon, and is
relatively minimal.  Identify the pieces of Nutch which are there to
make it into a program you can run, and push those into a separate
place.  For API's with multiple implementations, it would be nice to
not have be forced to use the same one the Core does when a plugin is
written.

3. As you stated earlier, use OSGi for a plugin system and some type
of dependency injection rather then hand parsed XML files.  I've had
problems with the PluginClassloader (I wanted to use Tika in my
plugin, and because of the plugin/classloader setup, I had to push the
POI libraries into the lib directory rather then in the
src/plugin/plugin-XXX/lib directory).  Well, that was the first
approach, the second was to hack the PluginClassloader to not delegate
to the parent for the "org.apache.tika" package and then provide Tika
in the plugin and it all worked.  Using an well known plug-in system
would have made this much easier.

4. Help transition to using the 3rd party libraries, Nutch still has
an SWF parser that went unmaintained in 2002.  Flash has moved a long
way, it would seem sensible to either jettison that code, or update to
newer versions of the same library by the same project (SWF2).  Not
that I care about Flash, but it seems that parsing isn't something
Nutch proper is focused on.

5. With whatever build system is chosen, figure out how to setup a
Maven build to construct "Out-of-Tree" Nutch plugins without having to
manually deal with all of the various dependencies and packaging
details.

6. Better support for running out of an IDE.  The instructions work,
and are very helpful.  It'd be much nicer to see the use of tools or
scripts to generate a saner system then is currently there (having
each plugin be a project in Eclipse would be a huge help to debugging
weird classpath issues).  Right now, running and compiling inside of
Eclipse isn't at all similar to running it outside, if you have any
time of classloader issues, or multiple conflicting libraries.  Not
that there are any in-tree right now, but I can see how future ones
could exist.

7. Make each plugin be it's own delive

Re: The Future of Nutch, reactivated

2009-05-14 Thread Mattmann, Chris A
Hi Andrzej,

Great summary. My general feeling on this is similar to my prior comments on
similar threads from Otis and from Dennis. My personal pet projects for
Nutch2:

* refactored Nutch core data structures, modeled as POJOs
* refactored Nutch architecture where crawling/indexing/parsing/scoring/etc.
are insulated from the underlying messaging substrate (e.g., crawl over JMS,
EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
framework, etc.)
* simpler Nutch deployment mechanisms (separate Nutch deployment package
from source code package), think about using Maven2

+1 to all of those and other ideas for how to improve the project's focus.

Cheers,
Chris


On 5/14/09 6:45 AM, "Andrzej Bialecki"  wrote:

> Hi all,
> 
> I'd like to revive this thread and gather additional feedback so that we
> end up with concrete conclusions. Much of what I write below others have
> said before, I'm trying here to express this as it looks from my point
> of view.
> 
> Target audience
> ===
> I think that the Nutch project experiences a crisis of personality now -
> we are not sure what is the target audience, and we cannot satisfy
> everyone. I think that there are following groups of Nutch users:
> 
> 1. Large-scale Internet crawl & search: actually, there are only few
> such users, because it takes considerable resources to manage operations
> on that scale. Scalability, manage-ability and ranking/spam prevention
> are the chief concerns here.
> 
> 2. Medium-scale vertical search: I suspect that many Nutch users fall
> into this category. Modularity, flexibility in implementing custom
> processing, ability to modify workflows and to use only some Nutch
> components seem to be chief concerns here. Scalability too, but only up
> to a volume of ~100-200 mln documents.
> 
> 3. Small- to medium-scale enterprise search: there's a sizeable number
> of Nutch users that fall into this category, for historical reasons.
> Link-based ranking and resource discovery are not that important here,
> but integration with Windows networking, Microsoft formats and databases
> , as well as realtime indexing and easy index maintenance are crucial.
> This class of users often has to heavily customize Nutch to get any
> sensible result. Also, this is where Solr really shines, so there is
> little benefit in using Nutch here. I predict that Nutch will have fewer
> and fewer users of this type.
> 
> 4. Single desktop to small intranet search: as above, but the accent is
> on the ease of use out of the box, and an often requested feature is a
> GUI frontend. Currently IMHO Nutch is too complex and requires too much
> command-line operation for casual users to make this use case attractive.
> 
> What is the target audience that we as a community want to support? By
> this I mean not only the moral support, but also active participation in
> the development process. From the place where we are at the moment we
> could go in any of the above directions.
> 
> Core competence
> ===
> This is a simple but important point. Currently we maintain several
> major subsystems in Nutch that are implemented by other projects, and
> often in a better way. Plugin framework (and dependency injection) and
> content parsing are two areas that we have to delegate to third-party
> libraries, such as Tika and OSGI or some other simple IOC container -
> probably there are other components that we don't have to do ourselves.
> Another thing that I'd love to delegate is the distributed search and
> index maintenance - either through Solr or Katta or something else.
> 
> The question then is, what is the core competence of this project? I see
> the following major areas that are unique to Nutch:
> 
> * crawling - this includes crawl scheduling (and re-crawl scheduling),
> discovery and classification of new resources, strategies for crawling
> specific sets of URLs (hosts and domains) under bandwidth and netiquette
> constraints, etc.
> 
> * web graph analysis - this includes link-based ranking, mirror
> detection (and URL "aliasing") but also link spam detection and a more
> complex control over the crawling frontier.
> 
> Anything more? I'm not sure - perhaps I would add template detection and
> pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> 
> Nutch 1.0 already made some steps in this direction, with the new link
> analysis package and pluggable FetchSchedule and Signature. A lot
> remains to be done here, and we are still spending a lot of resources on
> dealing with issues outside this core competence.
> 
> ---
> 
> So, what do we need to do next?
> 
> * we need to decide where we should commit our resources, as a community
> of users, contributors and committers, so that the project is most
> useful to our target audience. At this point there are few active
> committers, so I don't think we can cover more than 1 direction at a
> time ... ;)
> 
> * we need to re-architect Nutch to focu

The Future of Nutch, reactivated

2009-05-14 Thread Andrzej Bialecki

Hi all,

I'd like to revive this thread and gather additional feedback so that we
end up with concrete conclusions. Much of what I write below others have
said before, I'm trying here to express this as it looks from my point
of view.

Target audience
===
I think that the Nutch project experiences a crisis of personality now -
we are not sure what is the target audience, and we cannot satisfy
everyone. I think that there are following groups of Nutch users:

1. Large-scale Internet crawl & search: actually, there are only few
such users, because it takes considerable resources to manage operations
on that scale. Scalability, manage-ability and ranking/spam prevention
are the chief concerns here.

2. Medium-scale vertical search: I suspect that many Nutch users fall
into this category. Modularity, flexibility in implementing custom
processing, ability to modify workflows and to use only some Nutch
components seem to be chief concerns here. Scalability too, but only up
to a volume of ~100-200 mln documents.

3. Small- to medium-scale enterprise search: there's a sizeable number
of Nutch users that fall into this category, for historical reasons.
Link-based ranking and resource discovery are not that important here,
but integration with Windows networking, Microsoft formats and databases
, as well as realtime indexing and easy index maintenance are crucial.
This class of users often has to heavily customize Nutch to get any
sensible result. Also, this is where Solr really shines, so there is
little benefit in using Nutch here. I predict that Nutch will have fewer
and fewer users of this type.

4. Single desktop to small intranet search: as above, but the accent is
on the ease of use out of the box, and an often requested feature is a
GUI frontend. Currently IMHO Nutch is too complex and requires too much
command-line operation for casual users to make this use case attractive.

What is the target audience that we as a community want to support? By
this I mean not only the moral support, but also active participation in
the development process. From the place where we are at the moment we
could go in any of the above directions.

Core competence
===
This is a simple but important point. Currently we maintain several
major subsystems in Nutch that are implemented by other projects, and
often in a better way. Plugin framework (and dependency injection) and
content parsing are two areas that we have to delegate to third-party
libraries, such as Tika and OSGI or some other simple IOC container -
probably there are other components that we don't have to do ourselves.
Another thing that I'd love to delegate is the distributed search and
index maintenance - either through Solr or Katta or something else.

The question then is, what is the core competence of this project? I see
the following major areas that are unique to Nutch:

* crawling - this includes crawl scheduling (and re-crawl scheduling),
discovery and classification of new resources, strategies for crawling
specific sets of URLs (hosts and domains) under bandwidth and netiquette
constraints, etc.

* web graph analysis - this includes link-based ranking, mirror
detection (and URL "aliasing") but also link spam detection and a more
complex control over the crawling frontier.

Anything more? I'm not sure - perhaps I would add template detection and
pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).

Nutch 1.0 already made some steps in this direction, with the new link
analysis package and pluggable FetchSchedule and Signature. A lot
remains to be done here, and we are still spending a lot of resources on
dealing with issues outside this core competence.

---

So, what do we need to do next?

* we need to decide where we should commit our resources, as a community
of users, contributors and committers, so that the project is most
useful to our target audience. At this point there are few active
committers, so I don't think we can cover more than 1 direction at a
time ... ;)

* we need to re-architect Nutch to focus on our core competence, and
delegate what we can to other projects.

Feel free to comment on the above, make suggestions or corrections. I'd
like to wrap it up in a concise mission statement that would help us set
the goals for the next couple months.

--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com