Hi Frank,

Thanks for your interest in using Nutch!

The best way to see what's on the horizon, and needed in Nutch, is to check
out our JIRA issue tracking system, at:

http://issues.apache.org/jira/browse/NUTCH

At present, there are 39 current "issues" with Nutch, planned to be fixed,
or added (as a new feature), or improved (made to an existing feature), for
the upcoming 1.0.0 release. There are 222 open issues across all versions of
Nutch (including prior releases).

To help you digest the wealth of information that's there (and trust me,
there's plenty), I would offer a few of my own suggestions for class
projects:

(Difficulty: High) 1. Decouple Nutch's crawl infrastructure, and turn it
into its own extension point.The current Nutch crawl infrastructure is
highly coupled around a few, monolithic classes, Fetcher (or its big
brother, Fetcher2), Hadoop (as the underlying job/crawl execution platform),
etc. There have been several requests on the list to make the crawler its
own component, make it light-weight, make it configurable, etc. I think an
ambitious 2 week student project would be to take a stab at this decoupling.

(Difficulty: Medium) 2. Analyze the Nutch code base, and propose/suggest
architectural improvements. Currently, the Nutch code base is a behemoth of
plugins/extension points, configuration properties, and the like. It would
be nice to have a fresh look at its architecture, from an outsider's
perspective. The students would suggest places to cut/places to add, cleaner
interfaces, the appropriate underlying middleware substrates, e.g., is
Hadoop the only logical choice? What about other enterprise solutions such
as web services/EJB/JMS/etc.?

(Difficulty: Medium) 3. Use Spring as the underlying configuration framework
for Nutch, and overhaul Nutch's home-grown configuration infrastructure.
Spring is a an open source framework centered around providing configuration
and instantiation middleware capabilities: it lets developers focus on the
domain objects, and handles the rest. The student would first take a look at
Spring, then Nutch, then build a prototype that shows how Spring could be
used to configure Nutch.

There are plenty of others, but that should help get the juices flowing and
were just a few ideas off the top of my head.

Also, FYI, a course has been taught for a few semesters at the University of
Southern California (USC) by Dr. Ellis Horowitz on Search Engines. Here is a
pointer to that page. You can find some other Nutch project suggestions
there.

http://www-scf.usc.edu/~csci572/

Good luck!

Cheers,
 Chris


On 1/2/08 2:44 PM, "Frank McCown" <[EMAIL PROTECTED]> wrote:

> Greetings.  I'm teaching a class on search engine development this
> semester, and I am considering having my students use Nutch in their
> projects (I'm new to Nutch myself).  I'd like them to get some
> experience with an open source project and make a significant
> contribution.  Are there any implementation tasks you guys think would
> be appropriate for a small group of undergrad, upperclass CS students?
>  I'm looking for ideas for improving Nutch that they could accomplish
> in a few weeks time.
> 
> Thanks,

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.


Reply via email to