Re: Apache Nutch : query

Lewis John Mcgibbney Thu, 07 Apr 2016 07:28:02 -0700

Hi pesmadhu,
Replies inline

On Wed, Apr 6, 2016 at 7:14 AM, <user-digest-h...@nutch.apache.org> wrote:


>
> From: "pesmadhu ." <pesma...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Date: Wed, 6 Apr 2016 15:12:28 +0530
> Subject: Apache Nutch : query
> Hi,
>
>    We have a requirement to scrape the urls data which contains table data,
> we need to read the table content and depending on some column value of
> table data we need to download the file.
>
> Example urls : http://exporter.nih.gov/ExPORTER_Catalog.aspx
>
> http://exporter.nih.gov/ExPORTER_Catalog.aspx?sid=3&index=0
>
> http://exporter.nih.gov/ExPORTER_Catalog.aspx?sid=0&index=1
>
>
> Please check and suggest can we achieve this using Apache Nutch.
>

You need to write a plugin as described at -
http://wiki.apache.org/nutch/PluginCentral
The extension point you need to work with is
http://nutch.apache.org/apidocs/apidocs-1.11/index.html?org/apache/nutch/parse/HtmlParseFilter.html
This provides you access to the DocumentFragment and you can locate your
tables and columns.


>
> I have one more query, what is the main usage of Apache Nutch.
>
>
Generally speaking... Web crawling as described on http://nutch.apache.org
Thanks

Re: Apache Nutch : query

Reply via email to