Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by DennisKubes: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer The comment on the change is: Added section for learning the nutch and hadoop source codebases ------------------------------------------------------------------------------ Most questions on the lists are answered within a day. If you ask a question and it is not answered for a couple of days, do not repost the same question. Instead you may need to reword your question, provide more information, or give a better description in the subject. - more to come later ... + ==== Step Two: Learning the Nutch Source Code ==== + I have found that when teaching new developers the basics of the Nutch source code it is easiest to first start with learning the operations of a full crawl from start to finish. + A word about Hadoop. As soon as you start looking into Nutch code (versions .8 or higher) you will be looking at code that uses and extends Hadoop APIs. Learning the Hadoop source code base is as big an endeavour as learning the Nutch codebase, but because of how much Nutch relies on Hadoop, anyone serious about Nutch develop will also need to learn the Hadoop codebase. + + First start by getting Nutch up and running and having completed a full process of fetching through indexing. There are tutorials on this wiki that show how to do this. Second get Nutch setup to run in an integrated development environment such as Eclipse. There are also tutorials that show how to accomplish this. Once this is done you should be able to run individual Nutch components inside of a debugger. This is essential because probably the fastest way to learn the Nutch codebase is to step through different components in a debugger. + + Start by looking at the crawl package, specifically the Injector, Generator, and CrawlDb classes. Start with Injector, read over the source code for it and classes that it references. Run the class as a main program inside of a debugger. You will want to do this in local mode on the local file system. + + You will also want to take a look at the junit test classes for the same package. These can be found under the src/test folder. There should be test classes for all of the main components. Again read the source code for the test classes and execute the junit test cases to get a deeper understand of how each component works. + + Follow this pattern of reading the source code and and classes it references, running through the source in a debugger if possible, and reading and executing the junit test cases for each component in the crawl process. In order they are Injector, Generator, Fetcher, ParseSegment, CrawlDb, LinkDb, Indexer, DeleteDuplicates. Then you will want to look at other components including SegmentMerger, CrawlDbMerger, and DistributedSearch. It is usually best to get through one or two components before beginning to look at he Hadoop code and then switch back and forth between the Nutch and Hadoop code to understand where and how Nutch uses Hadoop. + + In Hadoop it is better to take the packages one at a time, for example mapred or dfs, then to take the strategy of running components as Hadoop is server based. You can still follow the pattern of reviewing junit tests to get an understanding of the Hadoop source code. Once you feel you have a grasp of various parts of the source code in Nutch or Hadoop I would recommend creating small junit test cases that use your newfound knowledge. For example you can create a small test case that fetches a few urls and verifies that they were fetched correctly. If you get through all of this then you will have a good foundation of knowledge in the Nutch and Hadoop source code bases and you should fine in starting to develop software for both Nutch and Hadoop. + ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-cvs mailing list Nutch-cvs@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-cvs