Hi, 2009/8/18 Doğacan Güney <[email protected]>
> Hi, > > On Tue, Aug 18, 2009 at 10:10, Ankit Dangi <[email protected]> wrote: > > > Hello All, > > > > After performing a crawl using Nutch, I wanted to read the content of all > > the crawled URLs. I performed the following command: > "$NUTCH_HOME/bin/nutch > > readseg -dump $segment myseg"; where, $segment contains the name of the > > segment file, and 'myseg' is the name of the directory where the dump of > > the > > segment is created. > > > > I understand that Nutch segment has 6 sub-directories.. crawl_generate, > > crawl_fetch, crawl_parse, parse_data, parse_text and content. One of the > > record obtained from the dump file has been kept at this URL: > > http://dangiankit.googlepages.com/rec-13.txt for your reference. Can > > anyone > > please look into the file and let me know as to why do we have 8 (eight) > > CrawlDatum sections.. I believe, there should have been only 3 such > > sections > > each for crawl_generate, crawl_fetch and crawl_parse. For other records, > > the > > count varies. Also, any other information regarding the CrawlDatum > sections > > would be appreciated. > > > > Most of those come from crawl_parse. In crawl_parse, there is a CrawlDatum > for every page that linked to your URL. Indeed, if you tried only dumping* > crawl_parse you would get most of those CrawlDatum-s. > It makes sense. But, the URL (which forms the specified record no. in the file) is one of the seed URLs I have given for the crawl. Also, I checked in to the linkdb, and found none of the URLs as it's Inlinks. Rather, it is one of the Inlinks to various URLs which is quite obvious. The command I had executed to get the linkdb was readlinkdb with the -dump option. > > * bin/nutch readseg -dump ... ... -nofetch -nogenerate -noparsedata > -noparsetext -nocontent > > > > > > P.S. > > Cross posted on nutch-dev and nutch-user. > > The record file has been hosted on my googlepages merely for reference, > no > > intentions of spamming please. > > > > Please do not crosspost your questions. > Okay. > > > > > > -- > > Ankit Dangi > > > > > > -- > Doğacan Güney > -- Ankit Dangi
