Hi, On Sat, Mar 14, 2009 at 02:19, Dennis Kubes <ku...@apache.org> wrote: > With the release of Nutch 1.0 I think it is a good time to begin a > discussion about the future of Nutch. Here are some things to consider and > would love to here everyones views on this > > Nutch's original intention was as a large-scale www search engine. That is > a very specific goal. Only a few people and organizations actually use it > on that level. (I just happen to be one of them as most of my work focuses > on large scale web search as opposed to vertical search). Many, perhaps > most, people using Nutch these days are either using parts of Nutch, such as > the crawler, or are targeting towards vertical or intranet type search > engines. This can be seen in how many people have already started using the > Solr integration features. So while Nutch was originally intended as a www > search, IMO most people aren't using it for that purpose. > > Since there are different purposes for different users, would it be good to > consider moving Nutch to a top level apache project out from under the > Lucene umbrella? This would then allow the creation of nutch sub-projects, > such as nutch-solr, nutch-hbase. Thoughts? > > Many parts of Nutch have also been implemented in other projects. For > example, Tika for the parsers, Droids for the Crawler. In begs the question > what is Nutch's core features going forward. When I think about search > (again my perspective is large scale), I think crawling or acquisition of > data, parsing, analysis, indexing, deployment, and searching. I personally > think that there is much room for improvement in crawling and especially > analysis. Nutch shouldn't just be about the shell but also the brains. >
I think nutch-solr and nutch-hbase should be in one unified project :) I can understand the difficulty (for newcomers) if we start depending on too many external projects. It would certainly be confusing to have to start a solr server then hbase master/slaves just to be able to crawl one intranet website locally. On the other hand, if we split nutch into nutch-hbase, nutch-hadoop and nutch-otherthings, I am worried we will have to create a waaay too generic interface to deal with them and not reap the advantages of using solr over lucene and hbase over hadoop. Also, more backends possibly mean more bugs and more integration problems. So I think delegating nutch functionality to other projects (tika/droids/solr/etc) is a great idea (so nutch can focus on "the brains" as Dennis said), but I don't like the idea of separating nutch into pieces. So I guess for a small vertical search engine, it may seem unnecessary to also deal with solr/etc, but as long as we have good documentation*, they are not that difficult to handle. And they don't have a large performance memory overhead. About vertical/large-scale search engine split: I guess a good example here is Dennis' FieldIndexer work. It is much more flexible for people who want to extend nutch's indexing architecture, but maybe overkill for people (and I am not convinced that it is) wanting to run vintage nutch on a small-scale. I, again, don't like splitting nutch into two(or three, four...) parts like this. But I think having different crawl paths for different users is much more manageable than having different architectures. So we always use solr/hbase/etc. as our architecture. But you can run a one-job indexer if you want or run FieldIndexer. You can use the on-the-fly scoring scheme or you use page rank/other complex offline scoring schemes. > And one of the biggest things I see is many newcomers to nutch have a very > hard time getting started. Part of this is understanding mapreduce > mentality, part is documentation, part is there is only so much time some of > us have to answer questions so some questions go unanswered on the lists. > How might this be improved going forward? > Docs, docs, docs :D > Any other thoughts also welcome. Really I want to start a discussion about > where everyone thinks we are with the state of Nutch and its future. > Thanks for starting the discussion Dennis. > Dennis > > * And we don't have good documentation right now (and I am much to blame for it:). I think this should be an explicit goal for us in the future. I am thinking something like "no major features without documentation in the wiki". -- Doğacan Güney