Re: Data Extraction from 100+ different sites...

Tony Mullins Tue, 11 Jun 2013 10:46:30 -0700

I have to crawl the sub-links as well of these sites. And have to identify
the pattern of these sub-links' html layout and extract my required data.
One example could be a Movie Review site , now every page of this site
would have (ideally) same HTML layout which describes a particular movie
and I have to extract the info for that page.



And for this requirement I am relying on Nutch + HtmlParse plugin....!!!



On Tue, Jun 11, 2013 at 10:34 PM, AC Nutch <acnu...@gmail.com> wrote:

> I'm a bit confused on where the requirement to *crawl* these sites comes
> into it? From what you're saying it looks like you're just talking about
> parsing the contents of a list of sites that you're trying to extract data
> from. In which case there's not much of a use case for Nutch... or am I
> confused?
>
>
> On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins <tonymullins...@gmail.com
> >wrote:
>
> > Yes all the web pages will have different HTML structure/layout and I
> would
> > have to identify/define a XPath expression for each one of them.
> >
> > But I am trying to come up with generic output format for these XPath
> > expressions.... so whatever the XPath expression is I want to have result
> > in lets say Field A , Field B , Field C . In some cases some of these
> > fields could be blank as well. So I could map them to my Solr schema
> > properly.
> >
> > In this regard I was hopping to get some help or guideline from your past
> > experiences ...
> >
> > Thanks,
> > Tony
> >
> >
> > On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch <acnu...@gmail.com> wrote:
> >
> > > Hi Tony,
> > >
> > > So if I understand correctly, you have 100+ web pages, each with a
> > totally
> > > different format that you're trying to extract separate/unrelated
> pieces
> > of
> > > information from. If there's no connection between any of the web pages
> > or
> > > any of the pieces of information that you're trying to extract then
> it's
> > > pretty much unavoidable to have to provide separate identifiers and
> cases
> > > for finding each one. Markus' suggestion I believe is to just have a
> > > "dictionary" file with URL as the key and XPath expression for the info
> > > that you want as the value. No matter what crawling/parsing platform
> > you're
> > > using a solution of that sort is pretty much unavoidable with the
> > > assumptions given.
> > >
> > > That being said, is there any common form that the data you're trying
> to
> > > extract from these pages follows? Is there a regex that could match it
> or
> > > anything else that might identify it in a common way?
> > >
> > > Alex
> > >
> > >
> > > On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins <
> tonymullins...@gmail.com
> > > >wrote:
> > >
> > > > Hi Markus,
> > > >
> > > > I couldn't understand how can I avoid switch cases in your suggested
> > > > idea....
> > > >
> > > > I would have one plugin which will implement HtmlParseFilter and I
> > would
> > > > have to check the current URL by getting content.getUrl() and this
> all
> > > will
> > > > be happening in same class so I would have to add swicth cases... I
> may
> > > > could add xpath expression for each site in separate files but to get
> > > XPath
> > > > expression I would have to decide which file I have to read and for
> > that
> > > I
> > > > would have to add my this code logic in swith case....
> > > >
> > > > Please correct me if I am getting this all wrong !!!
> > > >
> > > > And I think this is common requirement for web crawling solutions to
> > get
> > > > custom data from page... then are not there any such Nutch plugins
> > > > available on web ?
> > > >
> > > > Thanks,
> > > > Tony.
> > > >
> > > >
> > > > On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
> > > > <markus.jel...@openindex.io>wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > Yes, you should write a plugin that has a parse filter and indexing
> > > > > filter. To ease maintenance you would want to have a file per
> > > host/domain
> > > > > containing XPath expressions, far easier that switch statements
> that
> > > need
> > > > > to be recompiled. The indexing filter would then index the field
> > values
> > > > > extracted by your parse filter.
> > > > >
> > > > > Cheers,
> > > > > Markus
> > > > >
> > > > > -----Original message-----
> > > > > > From:Tony Mullins <tonymullins...@gmail.com>
> > > > > > Sent: Tue 11-Jun-2013 16:07
> > > > > > To: user@nutch.apache.org
> > > > > > Subject: Data Extraction from 100+ different sites...
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have 100+ different sites ( and may be more will be added in
> near
> > > > > > future), I have to crawl them and extract my required information
> > > from
> > > > > each
> > > > > > site. So each site would have its own extraction rule ( XPaths).
> > > > > >
> > > > > > So far I have seen there is no built-in mechanism in Nutch to
> > fulfill
> > > > my
> > > > > > requirement and I may  have to write custom HTMLParserFilter
> > > extension
> > > > > and
> > > > > > IndexFilter plugin.
> > > > > >
> > > > > > And I may have to write 100+ switch cases in my plugin to handle
> > the
> > > > > > extraction rules of each site....
> > > > > >
> > > > > > Is this the best way to handle my requirement or there is any
> > better
> > > > way
> > > > > to
> > > > > > handle it ?
> > > > > >
> > > > > > Thanks for your support & help.
> > > > > >
> > > > > > Tony.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Data Extraction from 100+ different sites...

Reply via email to