Hi Chris,
Thanks for the quick reply.
I already went through the page but it gives only technical information
about the directories but no information related to relation amongst these
folders and what they really mean in terms of crawled output.
Like for ex: what crawl_parse contains is it like all the crawled data
parsed in terms of html tags or it just contains all the urls extracted
from pages.

On Thu, Oct 1, 2015 at 11:32 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Please see:
>
> http://wiki.apache.org/nutch/NutchFileFormats
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: sanjay singh <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Thursday, October 1, 2015 at 11:22 PM
> To: "[email protected]" <[email protected]>
> Subject: Apache Nutch Output structure
>
> >Hi,
> >I am trying to crawl certain set of websites using Apache nutch. I
> >configured nutch with required parameters. After crawling I got various
> >segments as output which I merged into one segement.
> >But still I am unable to relate with the file structure that is there in
> >output and meaning associated with it.
> >I got in merged segment following directories
> >content
> >crawl_fetch
> >crawl_generate
> >crawl_parse
> >parse_data
> >parse_text
> >
> >Can someone please explain the significance of these directories or point
> >me to certain documentation which explains it in detail.
> >
> >
> >--
> >Regards,
> >Sanjay Singh, PICT Pune
>
>


-- 
Regards,
Sanjay Singh, PICT Pune

Reply via email to