Re: The Future of Nutch, reactivated

Raymond Balmès Fri, 15 May 2009 08:50:54 -0700

I 'm still a new user so although I found it rather easy to get going and
build my own plugin's I have some suggestions.


Yes one thing that I'd like to see is a kind of way to estimate how long
will a certain step (fetch, ...)  will take... something like a progress
bar. Because you launch a step and it can go on for days without knowing it
and perfectly working but still you have no idea when it might eventually
end.

I find the WEB front end rather difficult to change and I lost a lot of time
with the NucthBean for understanding how it works.
Coming from Lucene it took me a while to find out all the limitations it
has. So I haven't played much with NutchSolr integration but from the sound
of it looks more powerfull, simpler that is my concern.
-Raymond-
2009/5/14 Mattmann, Chris A <chris.a.mattm...@jpl.nasa.gov>

> Hi Andrzej,
>
> Great summary. My general feeling on this is similar to my prior comments
> on
> similar threads from Otis and from Dennis. My personal pet projects for
> Nutch2:
>
> * refactored Nutch core data structures, modeled as POJOs
> * refactored Nutch architecture where
> crawling/indexing/parsing/scoring/etc.
> are insulated from the underlying messaging substrate (e.g., crawl over
> JMS,
> EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other
> framework, etc.)
> * simpler Nutch deployment mechanisms (separate Nutch deployment package
> from source code package), think about using Maven2
>
> +1 to all of those and other ideas for how to improve the project's focus.
>
> Cheers,
> Chris
>
>
> On 5/14/09 6:45 AM, "Andrzej Bialecki" <a...@getopt.org> wrote:
>
> > Hi all,
> >
> > I'd like to revive this thread and gather additional feedback so that we
> > end up with concrete conclusions. Much of what I write below others have
> > said before, I'm trying here to express this as it looks from my point
> > of view.
> >
> > Target audience
> > ===============
> > I think that the Nutch project experiences a crisis of personality now -
> > we are not sure what is the target audience, and we cannot satisfy
> > everyone. I think that there are following groups of Nutch users:
> >
> > 1. Large-scale Internet crawl & search: actually, there are only few
> > such users, because it takes considerable resources to manage operations
> > on that scale. Scalability, manage-ability and ranking/spam prevention
> > are the chief concerns here.
> >
> > 2. Medium-scale vertical search: I suspect that many Nutch users fall
> > into this category. Modularity, flexibility in implementing custom
> > processing, ability to modify workflows and to use only some Nutch
> > components seem to be chief concerns here. Scalability too, but only up
> > to a volume of ~100-200 mln documents.
> >
> > 3. Small- to medium-scale enterprise search: there's a sizeable number
> > of Nutch users that fall into this category, for historical reasons.
> > Link-based ranking and resource discovery are not that important here,
> > but integration with Windows networking, Microsoft formats and databases
> > , as well as realtime indexing and easy index maintenance are crucial.
> > This class of users often has to heavily customize Nutch to get any
> > sensible result. Also, this is where Solr really shines, so there is
> > little benefit in using Nutch here. I predict that Nutch will have fewer
> > and fewer users of this type.
> >
> > 4. Single desktop to small intranet search: as above, but the accent is
> > on the ease of use out of the box, and an often requested feature is a
> > GUI frontend. Currently IMHO Nutch is too complex and requires too much
> > command-line operation for casual users to make this use case attractive.
> >
> > What is the target audience that we as a community want to support? By
> > this I mean not only the moral support, but also active participation in
> > the development process. From the place where we are at the moment we
> > could go in any of the above directions.
> >
> > Core competence
> > ===============
> > This is a simple but important point. Currently we maintain several
> > major subsystems in Nutch that are implemented by other projects, and
> > often in a better way. Plugin framework (and dependency injection) and
> > content parsing are two areas that we have to delegate to third-party
> > libraries, such as Tika and OSGI or some other simple IOC container -
> > probably there are other components that we don't have to do ourselves.
> > Another thing that I'd love to delegate is the distributed search and
> > index maintenance - either through Solr or Katta or something else.
> >
> > The question then is, what is the core competence of this project? I see
> > the following major areas that are unique to Nutch:
> >
> > * crawling - this includes crawl scheduling (and re-crawl scheduling),
> > discovery and classification of new resources, strategies for crawling
> > specific sets of URLs (hosts and domains) under bandwidth and netiquette
> > constraints, etc.
> >
> > * web graph analysis - this includes link-based ranking, mirror
> > detection (and URL "aliasing") but also link spam detection and a more
> > complex control over the crawling frontier.
> >
> > Anything more? I'm not sure - perhaps I would add template detection and
> > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites).
> >
> > Nutch 1.0 already made some steps in this direction, with the new link
> > analysis package and pluggable FetchSchedule and Signature. A lot
> > remains to be done here, and we are still spending a lot of resources on
> > dealing with issues outside this core competence.
> >
> > -------
> >
> > So, what do we need to do next?
> >
> > * we need to decide where we should commit our resources, as a community
> > of users, contributors and committers, so that the project is most
> > useful to our target audience. At this point there are few active
> > committers, so I don't think we can cover more than 1 direction at a
> > time ... ;)
> >
> > * we need to re-architect Nutch to focus on our core competence, and
> > delegate what we can to other projects.
> >
> > Feel free to comment on the above, make suggestions or corrections. I'd
> > like to wrap it up in a concise mission statement that would help us set
> > the goals for the next couple months.
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >   ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.mattm...@jpl.nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>

Re: The Future of Nutch, reactivated

Reply via email to