Re: Crawler Data

kranthi reddy Tue, 27 May 2008 12:07:36 -0700

Hi,
If u are trying to extract data from web pages...then u need to work on the
"parse-html" code.
In the "src/plugin" directory u can work on other formats like "pdf,msexcel"
etc...
U need to work on parsing because pages are parsed before they are
indexed.So if u parse them in a different way and extract the data u need
...u can index them using the extracted data.
bye
kranthi


On Tue, May 27, 2008 at 5:16 PM, Jorge Conejero Jarque <[EMAIL PROTECTED]>
wrote:

> I would like to make an application using the API Nutch, could extract data
> from the pages before being indexed in the index and be able to do them in
> some kind of modification or processing, because it can become something
> useful and interesting.
>
> The problem is that I can not find information on how to use the API part
> of the Crawl Nutch.
>
> Only find exercises that are executed by the console with Cygwin and that
> only explain aspects of configuration and creation of an index.
>
> If you could help me, with some examples.
> Thanks.
>
> Un saludo.
>
> Jorge Conejero Jarque
> Dpto. Java Technology Group
> GPM Factoría Internet
> 923100300
> http://www.gpm.es
>
>
>

Re: Crawler Data

Reply via email to