Re: Data Extraction from 100+ different sites...

Julien Nioche Wed, 12 Jun 2013 01:02:21 -0700

What I usually do in cases like these is to propagate an identifier from
the seeds and use that in the HTMLParsers to determine whether they should
process a page. See url-meta plugin for the config to propagate a metadatum
from the seeds. This way you don't need to act based on URL patterns but
need to have one HTMLParser per key. Since you have loads of different
extraction rules, you might be better off following the other suggestions
in this thread.


Julien


On 11 June 2013 18:45, Tony Mullins <tonymullins...@gmail.com> wrote:

> I have to crawl the sub-links as well of these sites. And have to identify
> the pattern of these sub-links' html layout and extract my required data.
> One example could be a Movie Review site , now every page of this site
> would have (ideally) same HTML layout which describes a particular movie
> and I have to extract the info for that page.
>
>
> And for this requirement I am relying on Nutch + HtmlParse plugin....!!!
>
>
>
> On Tue, Jun 11, 2013 at 10:34 PM, AC Nutch <acnu...@gmail.com> wrote:
>
> > I'm a bit confused on where the requirement to *crawl* these sites comes
> > into it? From what you're saying it looks like you're just talking about
> > parsing the contents of a list of sites that you're trying to extract
> data
> > from. In which case there's not much of a use case for Nutch... or am I
> > confused?
> >
> >
> > On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins <tonymullins...@gmail.com
> > >wrote:
> >
> > > Yes all the web pages will have different HTML structure/layout and I
> > would
> > > have to identify/define a XPath expression for each one of them.
> > >
> > > But I am trying to come up with generic output format for these XPath
> > > expressions.... so whatever the XPath expression is I want to have
> result
> > > in lets say Field A , Field B , Field C . In some cases some of these
> > > fields could be blank as well. So I could map them to my Solr schema
> > > properly.
> > >
> > > In this regard I was hopping to get some help or guideline from your
> past
> > > experiences ...
> > >
> > > Thanks,
> > > Tony
> > >
> > >
> > > On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch <acnu...@gmail.com> wrote:
> > >
> > > > Hi Tony,
> > > >
> > > > So if I understand correctly, you have 100+ web pages, each with a
> > > totally
> > > > different format that you're trying to extract separate/unrelated
> > pieces
> > > of
> > > > information from. If there's no connection between any of the web
> pages
> > > or
> > > > any of the pieces of information that you're trying to extract then
> > it's
> > > > pretty much unavoidable to have to provide separate identifiers and
> > cases
> > > > for finding each one. Markus' suggestion I believe is to just have a
> > > > "dictionary" file with URL as the key and XPath expression for the
> info
> > > > that you want as the value. No matter what crawling/parsing platform
> > > you're
> > > > using a solution of that sort is pretty much unavoidable with the
> > > > assumptions given.
> > > >
> > > > That being said, is there any common form that the data you're trying
> > to
> > > > extract from these pages follows? Is there a regex that could match
> it
> > or
> > > > anything else that might identify it in a common way?
> > > >
> > > > Alex
> > > >
> > > >
> > > > On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins <
> > tonymullins...@gmail.com
> > > > >wrote:
> > > >
> > > > > Hi Markus,
> > > > >
> > > > > I couldn't understand how can I avoid switch cases in your
> suggested
> > > > > idea....
> > > > >
> > > > > I would have one plugin which will implement HtmlParseFilter and I
> > > would
> > > > > have to check the current URL by getting content.getUrl() and this
> > all
> > > > will
> > > > > be happening in same class so I would have to add swicth cases... I
> > may
> > > > > could add xpath expression for each site in separate files but to
> get
> > > > XPath
> > > > > expression I would have to decide which file I have to read and for
> > > that
> > > > I
> > > > > would have to add my this code logic in swith case....
> > > > >
> > > > > Please correct me if I am getting this all wrong !!!
> > > > >
> > > > > And I think this is common requirement for web crawling solutions
> to
> > > get
> > > > > custom data from page... then are not there any such Nutch plugins
> > > > > available on web ?
> > > > >
> > > > > Thanks,
> > > > > Tony.
> > > > >
> > > > >
> > > > > On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma
> > > > > <markus.jel...@openindex.io>wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Yes, you should write a plugin that has a parse filter and
> indexing
> > > > > > filter. To ease maintenance you would want to have a file per
> > > > host/domain
> > > > > > containing XPath expressions, far easier that switch statements
> > that
> > > > need
> > > > > > to be recompiled. The indexing filter would then index the field
> > > values
> > > > > > extracted by your parse filter.
> > > > > >
> > > > > > Cheers,
> > > > > > Markus
> > > > > >
> > > > > > -----Original message-----
> > > > > > > From:Tony Mullins <tonymullins...@gmail.com>
> > > > > > > Sent: Tue 11-Jun-2013 16:07
> > > > > > > To: user@nutch.apache.org
> > > > > > > Subject: Data Extraction from 100+ different sites...
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I have 100+ different sites ( and may be more will be added in
> > near
> > > > > > > future), I have to crawl them and extract my required
> information
> > > > from
> > > > > > each
> > > > > > > site. So each site would have its own extraction rule (
> XPaths).
> > > > > > >
> > > > > > > So far I have seen there is no built-in mechanism in Nutch to
> > > fulfill
> > > > > my
> > > > > > > requirement and I may  have to write custom HTMLParserFilter
> > > > extension
> > > > > > and
> > > > > > > IndexFilter plugin.
> > > > > > >
> > > > > > > And I may have to write 100+ switch cases in my plugin to
> handle
> > > the
> > > > > > > extraction rules of each site....
> > > > > > >
> > > > > > > Is this the best way to handle my requirement or there is any
> > > better
> > > > > way
> > > > > > to
> > > > > > > handle it ?
> > > > > > >
> > > > > > > Thanks for your support & help.
> > > > > > >
> > > > > > > Tony.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Data Extraction from 100+ different sites...

Reply via email to