Hi, On Tue, Aug 18, 2009 at 10:10, Ankit Dangi <[email protected]> wrote:
> Hello All, > > After performing a crawl using Nutch, I wanted to read the content of all > the crawled URLs. I performed the following command: "$NUTCH_HOME/bin/nutch > readseg -dump $segment myseg"; where, $segment contains the name of the > segment file, and 'myseg' is the name of the directory where the dump of > the > segment is created. > > I understand that Nutch segment has 6 sub-directories.. crawl_generate, > crawl_fetch, crawl_parse, parse_data, parse_text and content. One of the > record obtained from the dump file has been kept at this URL: > http://dangiankit.googlepages.com/rec-13.txt for your reference. Can > anyone > please look into the file and let me know as to why do we have 8 (eight) > CrawlDatum sections.. I believe, there should have been only 3 such > sections > each for crawl_generate, crawl_fetch and crawl_parse. For other records, > the > count varies. Also, any other information regarding the CrawlDatum sections > would be appreciated. > Most of those come from crawl_parse. In crawl_parse, there is a CrawlDatum for every page that linked to your URL. Indeed, if you tried only dumping* crawl_parse you would get most of those CrawlDatum-s. * bin/nutch readseg -dump ... ... -nofetch -nogenerate -noparsedata -noparsetext -nocontent > > P.S. > Cross posted on nutch-dev and nutch-user. > The record file has been hosted on my googlepages merely for reference, no > intentions of spamming please. > Please do not crosspost your questions. > > -- > Ankit Dangi > -- Doğacan Güney
