Hi Markus,

I couldn't understand how can I avoid switch cases in your suggested

I would have one plugin which will implement HtmlParseFilter and I would
have to check the current URL by getting content.getUrl() and this all will
be happening in same class so I would have to add swicth cases... I may
could add xpath expression for each site in separate files but to get XPath
expression I would have to decide which file I have to read and for that I
would have to add my this code logic in swith case....

Please correct me if I am getting this all wrong !!!

And I think this is common requirement for web crawling solutions to get
custom data from page... then are not there any such Nutch plugins
available on web ?


On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma

> Hi,
> Yes, you should write a plugin that has a parse filter and indexing
> filter. To ease maintenance you would want to have a file per host/domain
> containing XPath expressions, far easier that switch statements that need
> to be recompiled. The indexing filter would then index the field values
> extracted by your parse filter.
> Cheers,
> Markus
> -----Original message-----
> > From:Tony Mullins <tonymullins...@gmail.com>
> > Sent: Tue 11-Jun-2013 16:07
> > To: user@nutch.apache.org
> > Subject: Data Extraction from 100+ different sites...
> >
> > Hi,
> >
> > I have 100+ different sites ( and may be more will be added in near
> > future), I have to crawl them and extract my required information from
> each
> > site. So each site would have its own extraction rule ( XPaths).
> >
> > So far I have seen there is no built-in mechanism in Nutch to fulfill my
> > requirement and I may  have to write custom HTMLParserFilter extension
> and
> > IndexFilter plugin.
> >
> > And I may have to write 100+ switch cases in my plugin to handle the
> > extraction rules of each site....
> >
> > Is this the best way to handle my requirement or there is any better way
> to
> > handle it ?
> >
> > Thanks for your support & help.
> >
> > Tony.
> >

Reply via email to