I actually use Nutch as a large scale search engine on two products. I think a few things that would be nice to have are built in options to produce an incremental index and maybe a quartz scheduler to automate it completely.
One thing that would be nice is when one of us figures something out like doing an incremental index, we would create a document and post it to the wiki. Documentation has been one of the big hurdles for me. Thanks for all your hard work and I hope to contribute to the project soon. Alex --- On Fri, 3/13/09, Dennis Kubes <ku...@apache.org> wrote: > From: Dennis Kubes <ku...@apache.org> > Subject: The Future of Nutch > To: nutch-user@lucene.apache.org > Date: Friday, March 13, 2009, 7:19 PM > With the release of Nutch 1.0 I think it is a good time to > begin a discussion about the future of Nutch. Here are some > things to consider and would love to here everyones views on > this > > Nutch's original intention was as a large-scale www > search engine. That is a very specific goal. Only a few > people and organizations actually use it on that level. (I > just happen to be one of them as most of my work focuses on > large scale web search as opposed to vertical search). Many, > perhaps most, people using Nutch these days are either using > parts of Nutch, such as the crawler, or are targeting > towards vertical or intranet type search engines. This can > be seen in how many people have already started using the > Solr integration features. So while Nutch was originally > intended as a www search, IMO most people aren't using > it for that purpose. > > Since there are different purposes for different users, > would it be good to consider moving Nutch to a top level > apache project out from under the Lucene umbrella? This > would then allow the creation of nutch sub-projects, such as > nutch-solr, nutch-hbase. Thoughts? > > Many parts of Nutch have also been implemented in other > projects. For example, Tika for the parsers, Droids for the > Crawler. In begs the question what is Nutch's core > features going forward. When I think about search (again my > perspective is large scale), I think crawling or acquisition > of data, parsing, analysis, indexing, deployment, and > searching. I personally think that there is much room for > improvement in crawling and especially analysis. Nutch > shouldn't just be about the shell but also the brains. > > And one of the biggest things I see is many newcomers to > nutch have a very hard time getting started. Part of this > is understanding mapreduce mentality, part is documentation, > part is there is only so much time some of us have to answer > questions so some questions go unanswered on the lists. How > might this be improved going forward? > > Any other thoughts also welcome. Really I want to start a > discussion about where everyone thinks we are with the state > of Nutch and its future. > > Dennis