Hi, If u are trying to extract data from web pages...then u need to work on the "parse-html" code. In the "src/plugin" directory u can work on other formats like "pdf,msexcel" etc... U need to work on parsing because pages are parsed before they are indexed.So if u parse them in a different way and extract the data u need ...u can index them using the extracted data. bye kranthi
On Tue, May 27, 2008 at 5:16 PM, Jorge Conejero Jarque <[EMAIL PROTECTED]> wrote: > I would like to make an application using the API Nutch, could extract data > from the pages before being indexed in the index and be able to do them in > some kind of modification or processing, because it can become something > useful and interesting. > > The problem is that I can not find information on how to use the API part > of the Crawl Nutch. > > Only find exercises that are executed by the console with Cygwin and that > only explain aspects of configuration and creation of an index. > > If you could help me, with some examples. > Thanks. > > Un saludo. > > Jorge Conejero Jarque > Dpto. Java Technology Group > GPM FactorÃa Internet > 923100300 > http://www.gpm.es > > >