Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer

The comment on the change is:
Added section for learning the nutch and hadoop source codebases

------------------------------------------------------------------------------
  
  Most questions on the lists are answered within a day.  If you ask a question 
and it is not answered for a couple of days, do not repost the same question.  
Instead you may need to reword your question, provide more information, or give 
a better description in the subject.
  
- more to come later ...
+ ==== Step Two: Learning the Nutch Source Code ====
+ I have found that when teaching new developers the basics of the Nutch source 
code it is easiest to first start with learning the operations of a full crawl 
from start to finish.
  
+ A word about Hadoop.  As soon as you start looking into Nutch code (versions 
.8 or higher) you will be looking at code that uses and extends Hadoop APIs.  
Learning the Hadoop source code base is as big an endeavour as learning the 
Nutch codebase, but because of how much Nutch relies on Hadoop, anyone serious 
about Nutch develop will also need to learn the Hadoop codebase.
+ 
+ First start by getting Nutch up and running and having completed a full 
process of fetching through indexing.  There are tutorials on this wiki that 
show how to do this.  Second get Nutch setup to run in an integrated 
development environment such as Eclipse.  There are also tutorials that show 
how to accomplish this.  Once this is done you should be able to run individual 
Nutch components inside of a debugger.  This is essential because probably the 
fastest way to learn the Nutch codebase is to step through different components 
in a debugger.
+ 
+ Start by looking at the crawl package, specifically the Injector, Generator, 
and CrawlDb classes.  Start with Injector, read over the source code for it and 
classes that it references.  Run the class as a main program inside of a 
debugger.  You will want to do this in local mode on the local file system.  
+ 
+ You will also want to take a look at the junit test classes for the same 
package.  These can be found under the src/test folder.  There should be test 
classes for all of the main components.  Again read the source code for the 
test classes and execute the junit test cases to get a deeper understand of how 
each component works.
+ 
+ Follow this pattern of reading the source code and and classes it references, 
running through the source in a debugger if possible, and reading and executing 
the junit test cases for each component in the crawl process.  In order they 
are Injector, Generator, Fetcher, ParseSegment, CrawlDb, LinkDb, Indexer, 
DeleteDuplicates.  Then you will want to look at other components including 
SegmentMerger, CrawlDbMerger, and DistributedSearch.  It is usually best to get 
through one or two components before beginning to look at he Hadoop code and 
then switch back and forth between the Nutch and Hadoop code to understand 
where and how Nutch uses Hadoop.
+ 
+ In Hadoop it is better to take the packages one at a time, for example mapred 
or dfs, then to take the strategy of running components as Hadoop is server 
based.  You can still follow the pattern of reviewing junit tests to get an 
understanding of the Hadoop source code.  Once you feel you have a grasp of 
various parts of the source code in Nutch or Hadoop I would recommend creating 
small junit test cases that use your newfound knowledge.  For example you can 
create a small test case that fetches a few urls and verifies that they were 
fetched correctly.  If you get through all of this then you will have a good 
foundation of knowledge in the Nutch and Hadoop source code bases and you 
should fine in starting to develop software for both Nutch and Hadoop.
+ 

Reply via email to