pike wrote: > Hi > > I'm new to nutch. > Can anyone point me to some documentation about > the directory structure Nutch creates and maintains > when crawling, indexing etc ? We're doing "whole-web" > crawls step by step. Since I have no reference, it's > hard to see wether crawling, merging, indexing, etc > went ok. > > > thanks! > *-pike > Well, unfortunately there is not much document out there. But you should start by reading the articles at the nutch wiki first. For the index structure you should seek help in the lucene wiki, since nutch uses lucene as an inverted index. To look at the generated indexes you can use luke or lucli(command line) tools. lucli can be found in the contrib directory of lucene.
Nutch stores the crawl state of the urls in the crawldb. The crawldb is an instance of Hadoop's MapFile, which is a sequence of <key,value> pairs. The keys in crawldb are urls and values are CrawlDatum objects. MapFile uses two SequenceFile s, one for storing the data, the other for indexing the data. You should check the javadocs of these classes for further info. Linkdb is also stored as map files, from urls to Inlink objects. For further info, you should really browse the javadocs, and skim through the code to get a deeper understanding of the system. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
