Not sure what you want to do with the text, I get access to it in the
plug-ins via parse.getText method.

That is assuming you are interesting in accessing in one of the plugin.

-Raymond-

2009/5/26 Julien Nioche <[email protected]>

> Hrishi,
>
> The best solution depends on what you want to do with the HTML data.
>
> Regarding your solution (1) - Nutch already stores it in the content
> subdirectory of the segments as Content objects in a MapFile. A better
> option would be to write a small map reduce program and specify as input
> the
> content subdir of the segment(s). Nutch uses the standard
> MapInputFileFormat
> from Nutch and so your Map function will get Content objects. Being
> implemented as Map-Reduce this will be distributed + gives you the
> possiblity to implement any specific processing you want on the map (or
> reduce) side.
>
> If you are not too familiar with Hadoop and writing Map Reduce code, I
> recommend Tom White's excellent book (http://www.hadoopbook.com/).
>
> Julien
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
>
> 2009/5/26 Hrishikesh Agashe <[email protected]>
>
> > Hi,
> >
> > After doing a bit of research it seems that there are two ways to get
> HTML
> > data out from Nutch:
> > 1. Change Nutch code to dump HTML data as it crawls
> > 2. Use "readseg" command after crawling finishes and segments are
> > generated.
> >
> > Is this correct?
> >
> > If so, I would like to know what approach is better. Specifically, in
> case
> > of Nutch on Hadoop, does "readseg" operate in distributed way or it just
> > operates on a single machine? If readseg just works on one machine, I
> don't
> > think it's feasible if segment sizes are large. In that case first
> approach
> > is better.
> >
> > Also, can anyone share their experiences for doing large crawls (1000s of
> > websites) and extracting out HTML data?
> >
> > Thanks,
> > --Hrishi
> >
> > DISCLAIMER
> > ==========
> > This e-mail may contain privileged and confidential information which is
> > the property of Persistent Systems Ltd. It is intended only for the use
> of
> > the individual or entity to which it is addressed. If you are not the
> > intended recipient, you are not authorized to read, retain, copy, print,
> > distribute or use this message. If you have received this communication
> in
> > error, please notify the sender and delete all copies of this message.
> > Persistent Systems Ltd. does not accept any liability for virus infected
> > mails.
> >
>

Reply via email to