Hello - you should try the newer dump tool, it dumps HTML files as is to some 
directory.
Markus

 
 
-----Original message-----
> From:Vijay Veluchamy <vijay.veluch...@gmail.com>
> Sent: Tuesday 5th April 2016 17:24
> To: user@nutch.apache.org
> Subject: RE: How to read segment dump?
> 
> Hi,
> 
> I am looking for crawling a website as HTML files. After that, I need to
> parse them and get the elements in it.
> 
> Thanks,
> Vijay
> On Apr 5, 2016 8:37 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:
> 
> > Hello, segment dumps are notorious hard to comprehend. What information
> > are you looking for? What do you mean by reading contents of a website?
> > Markus
> >
> >
> >
> > -----Original message-----
> > > From:Vijay Veluchamy <vijay.veluch...@gmail.com>
> > > Sent: Tuesday 5th April 2016 16:22
> > > To: user@nutch.apache.org
> > > Subject: How to read segment dump?
> > >
> > > Hi Team,
> > >
> > > I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> > > 1.x.
> > >
> > > I have followed the steps provided in the following URL upto 'invertlink'
> > > step.
> > >
> > > https://wiki.apache.org/nutch/NutchTutorial
> > >
> > > Then, used 'readseg' command to dump the segments. The dump file is
> > created
> > > successfully.
> > >
> > > Now, I have the following questions.
> > >
> > > 1. Is this the right file (segment dump file) to read contents of a
> > > website? If yes, how to read the contents from dump file? I am unable to
> > > read as it looks like encrypted.
> > > 2. Otherwise, how can I read the contents of a website?
> > >
> > > Thanks,
> > > Vijay
> > >
> >
> 

Reply via email to