Hi Frank, Thanks for your interest in using Nutch!
The best way to see what's on the horizon, and needed in Nutch, is to check out our JIRA issue tracking system, at: http://issues.apache.org/jira/browse/NUTCH At present, there are 39 current "issues" with Nutch, planned to be fixed, or added (as a new feature), or improved (made to an existing feature), for the upcoming 1.0.0 release. There are 222 open issues across all versions of Nutch (including prior releases). To help you digest the wealth of information that's there (and trust me, there's plenty), I would offer a few of my own suggestions for class projects: (Difficulty: High) 1. Decouple Nutch's crawl infrastructure, and turn it into its own extension point.The current Nutch crawl infrastructure is highly coupled around a few, monolithic classes, Fetcher (or its big brother, Fetcher2), Hadoop (as the underlying job/crawl execution platform), etc. There have been several requests on the list to make the crawler its own component, make it light-weight, make it configurable, etc. I think an ambitious 2 week student project would be to take a stab at this decoupling. (Difficulty: Medium) 2. Analyze the Nutch code base, and propose/suggest architectural improvements. Currently, the Nutch code base is a behemoth of plugins/extension points, configuration properties, and the like. It would be nice to have a fresh look at its architecture, from an outsider's perspective. The students would suggest places to cut/places to add, cleaner interfaces, the appropriate underlying middleware substrates, e.g., is Hadoop the only logical choice? What about other enterprise solutions such as web services/EJB/JMS/etc.? (Difficulty: Medium) 3. Use Spring as the underlying configuration framework for Nutch, and overhaul Nutch's home-grown configuration infrastructure. Spring is a an open source framework centered around providing configuration and instantiation middleware capabilities: it lets developers focus on the domain objects, and handles the rest. The student would first take a look at Spring, then Nutch, then build a prototype that shows how Spring could be used to configure Nutch. There are plenty of others, but that should help get the juices flowing and were just a few ideas off the top of my head. Also, FYI, a course has been taught for a few semesters at the University of Southern California (USC) by Dr. Ellis Horowitz on Search Engines. Here is a pointer to that page. You can find some other Nutch project suggestions there. http://www-scf.usc.edu/~csci572/ Good luck! Cheers, Chris On 1/2/08 2:44 PM, "Frank McCown" <[EMAIL PROTECTED]> wrote: > Greetings. I'm teaching a class on search engine development this > semester, and I am considering having my students use Nutch in their > projects (I'm new to Nutch myself). I'd like them to get some > experience with an open source project and make a significant > contribution. Are there any implementation tasks you guys think would > be appropriate for a small group of undergrad, upperclass CS students? > I'm looking for ideas for improving Nutch that they could accomplish > in a few weeks time. > > Thanks, ______________________________________________ Chris Mattmann, Ph.D. [EMAIL PROTECTED] Cognizant Development Engineer Early Detection Research Network Project _________________________________________________ Jet Propulsion Laboratory Pasadena, CA Office: 171-266B Mailstop: 171-246 _______________________________________________________ Disclaimer: The opinions presented within are my own and do not reflect those of either NASA, JPL, or the California Institute of Technology.