Hi pesmadhu, Replies inline On Wed, Apr 6, 2016 at 7:14 AM, <user-digest-h...@nutch.apache.org> wrote:
> > From: "pesmadhu ." <pesma...@gmail.com> > To: user@nutch.apache.org > Cc: > Date: Wed, 6 Apr 2016 15:12:28 +0530 > Subject: Apache Nutch : query > Hi, > > We have a requirement to scrape the urls data which contains table data, > we need to read the table content and depending on some column value of > table data we need to download the file. > > Example urls : http://exporter.nih.gov/ExPORTER_Catalog.aspx > > http://exporter.nih.gov/ExPORTER_Catalog.aspx?sid=3&index=0 > > http://exporter.nih.gov/ExPORTER_Catalog.aspx?sid=0&index=1 > > > Please check and suggest can we achieve this using Apache Nutch. > You need to write a plugin as described at - http://wiki.apache.org/nutch/PluginCentral The extension point you need to work with is http://nutch.apache.org/apidocs/apidocs-1.11/index.html?org/apache/nutch/parse/HtmlParseFilter.html This provides you access to the DocumentFragment and you can locate your tables and columns. > > I have one more query, what is the main usage of Apache Nutch. > > Generally speaking... Web crawling as described on http://nutch.apache.org Thanks