Hello - you should try the newer dump tool, it dumps HTML files as is to some directory. Markus
-----Original message----- > From:Vijay Veluchamy <vijay.veluch...@gmail.com> > Sent: Tuesday 5th April 2016 17:24 > To: user@nutch.apache.org > Subject: RE: How to read segment dump? > > Hi, > > I am looking for crawling a website as HTML files. After that, I need to > parse them and get the elements in it. > > Thanks, > Vijay > On Apr 5, 2016 8:37 PM, "Markus Jelsma" <markus.jel...@openindex.io> wrote: > > > Hello, segment dumps are notorious hard to comprehend. What information > > are you looking for? What do you mean by reading contents of a website? > > Markus > > > > > > > > -----Original message----- > > > From:Vijay Veluchamy <vijay.veluch...@gmail.com> > > > Sent: Tuesday 5th April 2016 16:22 > > > To: user@nutch.apache.org > > > Subject: How to read segment dump? > > > > > > Hi Team, > > > > > > I need to crawl a website using Apache Nutch. Currently, I am using Nutch > > > 1.x. > > > > > > I have followed the steps provided in the following URL upto 'invertlink' > > > step. > > > > > > https://wiki.apache.org/nutch/NutchTutorial > > > > > > Then, used 'readseg' command to dump the segments. The dump file is > > created > > > successfully. > > > > > > Now, I have the following questions. > > > > > > 1. Is this the right file (segment dump file) to read contents of a > > > website? If yes, how to read the contents from dump file? I am unable to > > > read as it looks like encrypted. > > > 2. Otherwise, how can I read the contents of a website? > > > > > > Thanks, > > > Vijay > > > > > >