Hello All, After performing a crawl using Nutch, I wanted to read the content of all the crawled URLs. I performed the following command: "$NUTCH_HOME/bin/nutch readseg -dump $segment myseg"; where, $segment contains the name of the segment file, and 'myseg' is the name of the directory where the dump of the segment is created.
I understand that Nutch segment has 6 sub-directories.. crawl_generate, crawl_fetch, crawl_parse, parse_data, parse_text and content. One of the record obtained from the dump file has been kept at this URL: http://dangiankit.googlepages.com/rec-13.txt for your reference. Can anyone please look into the file and let me know as to why do we have 8 (eight) CrawlDatum sections.. I believe, there should have been only 3 such sections each for crawl_generate, crawl_fetch and crawl_parse. For other records, the count varies. Also, any other information regarding the CrawlDatum sections would be appreciated. P.S. Cross posted on nutch-dev and nutch-user. The record file has been hosted on my googlepages merely for reference, no intentions of spamming please. -- Ankit Dangi