Hi Markus, I couldn't understand how can I avoid switch cases in your suggested idea....
I would have one plugin which will implement HtmlParseFilter and I would have to check the current URL by getting content.getUrl() and this all will be happening in same class so I would have to add swicth cases... I may could add xpath expression for each site in separate files but to get XPath expression I would have to decide which file I have to read and for that I would have to add my this code logic in swith case.... Please correct me if I am getting this all wrong !!! And I think this is common requirement for web crawling solutions to get custom data from page... then are not there any such Nutch plugins available on web ? Thanks, Tony. On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma <markus.jel...@openindex.io>wrote: > Hi, > > Yes, you should write a plugin that has a parse filter and indexing > filter. To ease maintenance you would want to have a file per host/domain > containing XPath expressions, far easier that switch statements that need > to be recompiled. The indexing filter would then index the field values > extracted by your parse filter. > > Cheers, > Markus > > -----Original message----- > > From:Tony Mullins <tonymullins...@gmail.com> > > Sent: Tue 11-Jun-2013 16:07 > > To: user@nutch.apache.org > > Subject: Data Extraction from 100+ different sites... > > > > Hi, > > > > I have 100+ different sites ( and may be more will be added in near > > future), I have to crawl them and extract my required information from > each > > site. So each site would have its own extraction rule ( XPaths). > > > > So far I have seen there is no built-in mechanism in Nutch to fulfill my > > requirement and I may have to write custom HTMLParserFilter extension > and > > IndexFilter plugin. > > > > And I may have to write 100+ switch cases in my plugin to handle the > > extraction rules of each site.... > > > > Is this the best way to handle my requirement or there is any better way > to > > handle it ? > > > > Thanks for your support & help. > > > > Tony. > > >