Hi,

2009/8/18 Doğacan Güney <[email protected]>

> Hi,
>
> On Tue, Aug 18, 2009 at 10:10, Ankit Dangi <[email protected]> wrote:
>
> > Hello All,
> >
> > After performing a crawl using Nutch, I wanted to read the content of all
> > the crawled URLs. I performed the following command:
> "$NUTCH_HOME/bin/nutch
> > readseg -dump $segment myseg"; where, $segment contains the name of the
> > segment file, and 'myseg' is the name of the directory where the dump of
> > the
> > segment is created.
> >
> > I understand that Nutch segment has 6 sub-directories.. crawl_generate,
> > crawl_fetch, crawl_parse, parse_data, parse_text and content. One of the
> > record obtained from the dump file has been kept at this URL:
> > http://dangiankit.googlepages.com/rec-13.txt for your reference. Can
> > anyone
> > please look into the file and let me know as to why do we have 8 (eight)
> > CrawlDatum sections.. I believe, there should have been only 3 such
> > sections
> > each for crawl_generate, crawl_fetch and crawl_parse. For other records,
> > the
> > count varies. Also, any other information regarding the CrawlDatum
> sections
> > would be appreciated.
> >
>
> Most of those come from crawl_parse. In crawl_parse, there is a CrawlDatum
> for every page that linked to your URL. Indeed, if you tried only dumping*
> crawl_parse you would get most of those CrawlDatum-s.
>

It makes sense. But, the URL (which forms the specified record no. in the
file) is one of the seed URLs I have given for the crawl. Also, I checked in
to the linkdb, and found none of the URLs as it's Inlinks. Rather, it is one
of the Inlinks to various URLs which is quite obvious. The command I had
executed to get the linkdb was readlinkdb with the -dump option.



>
> * bin/nutch readseg -dump ... ... -nofetch -nogenerate -noparsedata
> -noparsetext -nocontent
>
>
> >
> > P.S.
> > Cross posted on nutch-dev and nutch-user.
> > The record file has been hosted on my googlepages merely for reference,
> no
> > intentions of spamming please.
> >
>
> Please do not crosspost your questions.
>

Okay.


>
>
> >
> > --
> > Ankit Dangi
> >
>
>
>
> --
> Doğacan Güney
>



-- 
Ankit Dangi

Reply via email to