I 'm still a new user so although I found it rather easy to get going and build my own plugin's I have some suggestions.
Yes one thing that I'd like to see is a kind of way to estimate how long will a certain step (fetch, ...) will take... something like a progress bar. Because you launch a step and it can go on for days without knowing it and perfectly working but still you have no idea when it might eventually end. I find the WEB front end rather difficult to change and I lost a lot of time with the NucthBean for understanding how it works. Coming from Lucene it took me a while to find out all the limitations it has. So I haven't played much with NutchSolr integration but from the sound of it looks more powerfull, simpler that is my concern. -Raymond- 2009/5/14 Mattmann, Chris A <chris.a.mattm...@jpl.nasa.gov> > Hi Andrzej, > > Great summary. My general feeling on this is similar to my prior comments > on > similar threads from Otis and from Dennis. My personal pet projects for > Nutch2: > > * refactored Nutch core data structures, modeled as POJOs > * refactored Nutch architecture where > crawling/indexing/parsing/scoring/etc. > are insulated from the underlying messaging substrate (e.g., crawl over > JMS, > EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other > framework, etc.) > * simpler Nutch deployment mechanisms (separate Nutch deployment package > from source code package), think about using Maven2 > > +1 to all of those and other ideas for how to improve the project's focus. > > Cheers, > Chris > > > On 5/14/09 6:45 AM, "Andrzej Bialecki" <a...@getopt.org> wrote: > > > Hi all, > > > > I'd like to revive this thread and gather additional feedback so that we > > end up with concrete conclusions. Much of what I write below others have > > said before, I'm trying here to express this as it looks from my point > > of view. > > > > Target audience > > =============== > > I think that the Nutch project experiences a crisis of personality now - > > we are not sure what is the target audience, and we cannot satisfy > > everyone. I think that there are following groups of Nutch users: > > > > 1. Large-scale Internet crawl & search: actually, there are only few > > such users, because it takes considerable resources to manage operations > > on that scale. Scalability, manage-ability and ranking/spam prevention > > are the chief concerns here. > > > > 2. Medium-scale vertical search: I suspect that many Nutch users fall > > into this category. Modularity, flexibility in implementing custom > > processing, ability to modify workflows and to use only some Nutch > > components seem to be chief concerns here. Scalability too, but only up > > to a volume of ~100-200 mln documents. > > > > 3. Small- to medium-scale enterprise search: there's a sizeable number > > of Nutch users that fall into this category, for historical reasons. > > Link-based ranking and resource discovery are not that important here, > > but integration with Windows networking, Microsoft formats and databases > > , as well as realtime indexing and easy index maintenance are crucial. > > This class of users often has to heavily customize Nutch to get any > > sensible result. Also, this is where Solr really shines, so there is > > little benefit in using Nutch here. I predict that Nutch will have fewer > > and fewer users of this type. > > > > 4. Single desktop to small intranet search: as above, but the accent is > > on the ease of use out of the box, and an often requested feature is a > > GUI frontend. Currently IMHO Nutch is too complex and requires too much > > command-line operation for casual users to make this use case attractive. > > > > What is the target audience that we as a community want to support? By > > this I mean not only the moral support, but also active participation in > > the development process. From the place where we are at the moment we > > could go in any of the above directions. > > > > Core competence > > =============== > > This is a simple but important point. Currently we maintain several > > major subsystems in Nutch that are implemented by other projects, and > > often in a better way. Plugin framework (and dependency injection) and > > content parsing are two areas that we have to delegate to third-party > > libraries, such as Tika and OSGI or some other simple IOC container - > > probably there are other components that we don't have to do ourselves. > > Another thing that I'd love to delegate is the distributed search and > > index maintenance - either through Solr or Katta or something else. > > > > The question then is, what is the core competence of this project? I see > > the following major areas that are unique to Nutch: > > > > * crawling - this includes crawl scheduling (and re-crawl scheduling), > > discovery and classification of new resources, strategies for crawling > > specific sets of URLs (hosts and domains) under bandwidth and netiquette > > constraints, etc. > > > > * web graph analysis - this includes link-based ranking, mirror > > detection (and URL "aliasing") but also link spam detection and a more > > complex control over the crawling frontier. > > > > Anything more? I'm not sure - perhaps I would add template detection and > > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). > > > > Nutch 1.0 already made some steps in this direction, with the new link > > analysis package and pluggable FetchSchedule and Signature. A lot > > remains to be done here, and we are still spending a lot of resources on > > dealing with issues outside this core competence. > > > > ------- > > > > So, what do we need to do next? > > > > * we need to decide where we should commit our resources, as a community > > of users, contributors and committers, so that the project is most > > useful to our target audience. At this point there are few active > > committers, so I don't think we can cover more than 1 direction at a > > time ... ;) > > > > * we need to re-architect Nutch to focus on our core competence, and > > delegate what we can to other projects. > > > > Feel free to comment on the above, make suggestions or corrections. I'd > > like to wrap it up in a concise mission statement that would help us set > > the goals for the next couple months. > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.mattm...@jpl.nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > >