Hi,

I am looking for crawling a website as HTML files. After that, I need to
parse them and get the elements in it.

Thanks,
Vijay
On Apr 5, 2016 8:37 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote:

> Hello, segment dumps are notorious hard to comprehend. What information
> are you looking for? What do you mean by reading contents of a website?
> Markus
>
>
>
> -----Original message-----
> > From:Vijay Veluchamy <vijay.veluch...@gmail.com>
> > Sent: Tuesday 5th April 2016 16:22
> > To: user@nutch.apache.org
> > Subject: How to read segment dump?
> >
> > Hi Team,
> >
> > I need to crawl a website using Apache Nutch. Currently, I am using Nutch
> > 1.x.
> >
> > I have followed the steps provided in the following URL upto 'invertlink'
> > step.
> >
> > https://wiki.apache.org/nutch/NutchTutorial
> >
> > Then, used 'readseg' command to dump the segments. The dump file is
> created
> > successfully.
> >
> > Now, I have the following questions.
> >
> > 1. Is this the right file (segment dump file) to read contents of a
> > website? If yes, how to read the contents from dump file? I am unable to
> > read as it looks like encrypted.
> > 2. Otherwise, how can I read the contents of a website?
> >
> > Thanks,
> > Vijay
> >
>

Reply via email to