Re: The Future of Nutch

Doğacan Güney Fri, 20 Mar 2009 02:56:27 -0700

Hi,

On Sat, Mar 14, 2009 at 02:19, Dennis Kubes <ku...@apache.org> wrote:
> With the release of Nutch 1.0 I think it is a good time to begin a
> discussion about the future of Nutch.  Here are some things to consider and
> would love to here everyones views on this
>
> Nutch's original intention was as a large-scale www search engine.  That is
> a very specific goal.  Only a few people and organizations actually use it
> on that level.  (I just happen to be one of them as most of my work focuses
> on large scale web search as opposed to vertical search). Many, perhaps
> most, people using Nutch these days are either using parts of Nutch, such as
> the crawler, or are targeting towards vertical or intranet type search
> engines.  This can be seen in how many people have already started using the
> Solr integration features.  So while Nutch was originally intended as a www
> search, IMO most people aren't using it for that purpose.
>
> Since there are different purposes for different users, would it be good to
> consider moving Nutch to a top level apache project out from under the
> Lucene umbrella?  This would then allow the creation of nutch sub-projects,
> such as nutch-solr, nutch-hbase.  Thoughts?
>
> Many parts of Nutch have also been implemented in other projects.  For
> example, Tika for the parsers, Droids for the Crawler.  In begs the question
> what is Nutch's core features going forward.  When I think about search
> (again my perspective is large scale), I think crawling or acquisition of
> data, parsing, analysis, indexing, deployment, and searching.  I personally
> think that there is much room for improvement in crawling and especially
> analysis.  Nutch shouldn't just be about the shell but also the brains.
>


I think nutch-solr and nutch-hbase should be in one unified project :)

I can understand the difficulty (for newcomers) if we start depending
on too many external projects. It would certainly be confusing
to have to start a solr server then hbase master/slaves just to be
able to crawl one intranet website locally. On the other hand,
if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings,
I am worried we will have to create a waaay too generic interface
to deal with them and not reap the advantages of using solr over
lucene and hbase over hadoop. Also, more backends possibly
mean more bugs and more integration problems.

So I think delegating nutch functionality to other projects
(tika/droids/solr/etc)
is a great idea (so nutch can focus on "the brains" as Dennis said), but
I don't like the idea of separating nutch into pieces.

So I guess for a small vertical search engine, it may seem unnecessary
to also deal with solr/etc, but as long as we have good documentation*,
they are not that difficult to handle. And they don't have a large performance
memory overhead.

About vertical/large-scale search engine split: I guess a good example here
is Dennis' FieldIndexer work. It is much more flexible for people who want
to extend nutch's indexing architecture, but maybe overkill for people (and
I am not convinced that it is) wanting to run vintage nutch on a small-scale.
I, again, don't like splitting nutch into two(or three, four...) parts
like this. But
I think having different crawl paths for different users is much more manageable
than having different architectures. So we always use solr/hbase/etc. as our
architecture. But you can run a one-job indexer if you want or run FieldIndexer.
You can use the on-the-fly scoring scheme or you use page rank/other complex
offline scoring schemes.

> And one of the biggest things I see is many newcomers to nutch have a very
> hard time getting started.  Part of this is understanding mapreduce
> mentality, part is documentation, part is there is only so much time some of
> us have to answer questions so some questions go unanswered on the lists.
>  How might this be improved going forward?
>

Docs, docs, docs :D

> Any other thoughts also welcome.  Really I want to start a discussion about
> where everyone thinks we are with the state of Nutch and its future.
>

Thanks for starting the discussion Dennis.

> Dennis
>
>

* And we don't have good documentation right now (and I am much
to blame for it:). I think this should be an explicit goal for us in the
future. I am thinking something like "no major features without documentation
in the wiki".



-- 
Doğacan Güney

Re: The Future of Nutch

Reply via email to