All, I am a masters student and want to crawl the whole web for my masters project.
While trying to generate, fetch, crawl the whole web using Nutch (I am following steps from http://lucene.apache.org/nutch/tutorial8.html), I got confused among various nutch terms and usage: 1) What is the purpose and difference between *crawl_fetch *and* crawldb* ? If nutch stores all the info regarding urls in * crawldb*, then what is the need for *crawl_fetch*? 2) Moreover, what does fetch and generate do? Can anyone describe in detail? Is there any documentation for nutch commands like generate, fetch, etc? Thanks & Regards, Gaurang Patel