Re: Data Extraction from 100+ different sites...

AC Nutch Tue, 11 Jun 2013 10:10:39 -0700

Hi Tony,

So if I understand correctly, you have 100+ web pages, each with a totally
different format that you're trying to extract separate/unrelated pieces of
information from. If there's no connection between any of the web pages or
any of the pieces of information that you're trying to extract then it's
pretty much unavoidable to have to provide separate identifiers and cases
for finding each one. Markus' suggestion I believe is to just have a
"dictionary" file with URL as the key and XPath expression for the info
that you want as the value. No matter what crawling/parsing platform you're
using a solution of that sort is pretty much unavoidable with the
assumptions given.


That being said, is there any common form that the data you're trying to
extract from these pages follows? Is there a regex that could match it or
anything else that might identify it in a common way?

Alex


On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins <tonymullins...@gmail.com>wrote:

> Hi Markus,
>
> I couldn't understand how can I avoid switch cases in your suggested
> idea....
>
> I would have one plugin which will implement HtmlParseFilter and I would
> have to check the current URL by getting content.getUrl() and this all will
> be happening in same class so I would have to add swicth cases... I may
> could add xpath expression for each site in separate files but to get XPath
> expression I would have to decide which file I have to read and for that I
> would have to add my this code logic in swith case....
>
> Please correct me if I am getting this all wrong !!!
>
> And I think this is common requirement for web crawling solutions to get
> custom data from page... then are not there any such Nutch plugins
> available on web ?
>
> Thanks,
> Tony.
>
>
> On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
> <markus.jel...@openindex.io>wrote:
>
> > Hi,
> >
> > Yes, you should write a plugin that has a parse filter and indexing
> > filter. To ease maintenance you would want to have a file per host/domain
> > containing XPath expressions, far easier that switch statements that need
> > to be recompiled. The indexing filter would then index the field values
> > extracted by your parse filter.
> >
> > Cheers,
> > Markus
> >
> > -----Original message-----
> > > From:Tony Mullins <tonymullins...@gmail.com>
> > > Sent: Tue 11-Jun-2013 16:07
> > > To: user@nutch.apache.org
> > > Subject: Data Extraction from 100+ different sites...
> > >
> > > Hi,
> > >
> > > I have 100+ different sites ( and may be more will be added in near
> > > future), I have to crawl them and extract my required information from
> > each
> > > site. So each site would have its own extraction rule ( XPaths).
> > >
> > > So far I have seen there is no built-in mechanism in Nutch to fulfill
> my
> > > requirement and I may  have to write custom HTMLParserFilter extension
> > and
> > > IndexFilter plugin.
> > >
> > > And I may have to write 100+ switch cases in my plugin to handle the
> > > extraction rules of each site....
> > >
> > > Is this the best way to handle my requirement or there is any better
> way
> > to
> > > handle it ?
> > >
> > > Thanks for your support & help.
> > >
> > > Tony.
> > >
> >
>

Re: Data Extraction from 100+ different sites...

Reply via email to