RE: Data Extraction from 100+ different sites...

Markus Jelsma Tue, 11 Jun 2013 12:16:32 -0700
You can use URLUtil in that parse filter to determine on which host/domain you 
are and lazy load the file with expressions for that host. Just keep a 
Map<hostname, List<expressions>> in your object and load lists of expressions 
on demand.
 
-----Original message-----
> From:Tony Mullins <tonymullins...@gmail.com>
> Sent: Tue 11-Jun-2013 18:59
> To: user@nutch.apache.org
> Subject: Re: Data Extraction from 100+ different sites...
> 
> Hi Markus,
> 
> I couldn't understand how can I avoid switch cases in your suggested
> idea....
> 
> I would have one plugin which will implement HtmlParseFilter and I would
> have to check the current URL by getting content.getUrl() and this all will
> be happening in same class so I would have to add swicth cases... I may
> could add xpath expression for each site in separate files but to get XPath
> expression I would have to decide which file I have to read and for that I
> would have to add my this code logic in swith case....
> 
> Please correct me if I am getting this all wrong !!!
> 
> And I think this is common requirement for web crawling solutions to get
> custom data from page... then are not there any such Nutch plugins
> available on web ?
> 
> Thanks,
> Tony.
> 
> 
> On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
> 
> > Hi,
> >
> > Yes, you should write a plugin that has a parse filter and indexing
> > filter. To ease maintenance you would want to have a file per host/domain
> > containing XPath expressions, far easier that switch statements that need
> > to be recompiled. The indexing filter would then index the field values
> > extracted by your parse filter.
> >
> > Cheers,
> > Markus
> >
> > -----Original message-----
> > > From:Tony Mullins <tonymullins...@gmail.com>
> > > Sent: Tue 11-Jun-2013 16:07
> > > To: user@nutch.apache.org
> > > Subject: Data Extraction from 100+ different sites...
> > >
> > > Hi,
> > >
> > > I have 100+ different sites ( and may be more will be added in near
> > > future), I have to crawl them and extract my required information from
> > each
> > > site. So each site would have its own extraction rule ( XPaths).
> > >
> > > So far I have seen there is no built-in mechanism in Nutch to fulfill my
> > > requirement and I may  have to write custom HTMLParserFilter extension
> > and
> > > IndexFilter plugin.
> > >
> > > And I may have to write 100+ switch cases in my plugin to handle the
> > > extraction rules of each site....
> > >
> > > Is this the best way to handle my requirement or there is any better way
> > to
> > > handle it ?
> > >
> > > Thanks for your support & help.
> > >
> > > Tony.
> > >
> >
>
RE: Data Extraction from 100+ different sites...

Reply via email to