Not sure what you want to do with the text, I get access to it in the plug-ins via parse.getText method.
That is assuming you are interesting in accessing in one of the plugin. -Raymond- 2009/5/26 Julien Nioche <[email protected]> > Hrishi, > > The best solution depends on what you want to do with the HTML data. > > Regarding your solution (1) - Nutch already stores it in the content > subdirectory of the segments as Content objects in a MapFile. A better > option would be to write a small map reduce program and specify as input > the > content subdir of the segment(s). Nutch uses the standard > MapInputFileFormat > from Nutch and so your Map function will get Content objects. Being > implemented as Map-Reduce this will be distributed + gives you the > possiblity to implement any specific processing you want on the map (or > reduce) side. > > If you are not too familiar with Hadoop and writing Map Reduce code, I > recommend Tom White's excellent book (http://www.hadoopbook.com/). > > Julien > > -- > DigitalPebble Ltd > http://www.digitalpebble.com > > > 2009/5/26 Hrishikesh Agashe <[email protected]> > > > Hi, > > > > After doing a bit of research it seems that there are two ways to get > HTML > > data out from Nutch: > > 1. Change Nutch code to dump HTML data as it crawls > > 2. Use "readseg" command after crawling finishes and segments are > > generated. > > > > Is this correct? > > > > If so, I would like to know what approach is better. Specifically, in > case > > of Nutch on Hadoop, does "readseg" operate in distributed way or it just > > operates on a single machine? If readseg just works on one machine, I > don't > > think it's feasible if segment sizes are large. In that case first > approach > > is better. > > > > Also, can anyone share their experiences for doing large crawls (1000s of > > websites) and extracting out HTML data? > > > > Thanks, > > --Hrishi > > > > DISCLAIMER > > ========== > > This e-mail may contain privileged and confidential information which is > > the property of Persistent Systems Ltd. It is intended only for the use > of > > the individual or entity to which it is addressed. If you are not the > > intended recipient, you are not authorized to read, retain, copy, print, > > distribute or use this message. If you have received this communication > in > > error, please notify the sender and delete all copies of this message. > > Persistent Systems Ltd. does not accept any liability for virus infected > > mails. > > >
