Hi, I am looking for crawling a website as HTML files. After that, I need to parse them and get the elements in it.
Thanks, Vijay On Apr 5, 2016 8:37 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote: > Hello, segment dumps are notorious hard to comprehend. What information > are you looking for? What do you mean by reading contents of a website? > Markus > > > > -----Original message----- > > From:Vijay Veluchamy <vijay.veluch...@gmail.com> > > Sent: Tuesday 5th April 2016 16:22 > > To: user@nutch.apache.org > > Subject: How to read segment dump? > > > > Hi Team, > > > > I need to crawl a website using Apache Nutch. Currently, I am using Nutch > > 1.x. > > > > I have followed the steps provided in the following URL upto 'invertlink' > > step. > > > > https://wiki.apache.org/nutch/NutchTutorial > > > > Then, used 'readseg' command to dump the segments. The dump file is > created > > successfully. > > > > Now, I have the following questions. > > > > 1. Is this the right file (segment dump file) to read contents of a > > website? If yes, how to read the contents from dump file? I am unable to > > read as it looks like encrypted. > > 2. Otherwise, how can I read the contents of a website? > > > > Thanks, > > Vijay > > >