Re: SegmentReader: Why Multiple CrawlDatum section for a record..

Doğacan Güney Tue, 18 Aug 2009 00:15:39 -0700

Hi,

On Tue, Aug 18, 2009 at 10:10, Ankit Dangi <[email protected]> wrote:


> Hello All,
>
> After performing a crawl using Nutch, I wanted to read the content of all
> the crawled URLs. I performed the following command: "$NUTCH_HOME/bin/nutch
> readseg -dump $segment myseg"; where, $segment contains the name of the
> segment file, and 'myseg' is the name of the directory where the dump of
> the
> segment is created.
>
> I understand that Nutch segment has 6 sub-directories.. crawl_generate,
> crawl_fetch, crawl_parse, parse_data, parse_text and content. One of the
> record obtained from the dump file has been kept at this URL:
> http://dangiankit.googlepages.com/rec-13.txt for your reference. Can
> anyone
> please look into the file and let me know as to why do we have 8 (eight)
> CrawlDatum sections.. I believe, there should have been only 3 such
> sections
> each for crawl_generate, crawl_fetch and crawl_parse. For other records,
> the
> count varies. Also, any other information regarding the CrawlDatum sections
> would be appreciated.
>

Most of those come from crawl_parse. In crawl_parse, there is a CrawlDatum
for every page that linked to your URL. Indeed, if you tried only dumping*
crawl_parse you would get most of those CrawlDatum-s.

* bin/nutch readseg -dump ... ... -nofetch -nogenerate -noparsedata
-noparsetext -nocontent


>
> P.S.
> Cross posted on nutch-dev and nutch-user.
> The record file has been hosted on my googlepages merely for reference, no
> intentions of spamming please.
>

Please do not crosspost your questions.


>
> --
> Ankit Dangi
>



-- 
Doğacan Güney

Re: SegmentReader: Why Multiple CrawlDatum section for a record..

Reply via email to