I have to crawl the sub-links as well of these sites. And have to identify the pattern of these sub-links' html layout and extract my required data. One example could be a Movie Review site , now every page of this site would have (ideally) same HTML layout which describes a particular movie and I have to extract the info for that page.
And for this requirement I am relying on Nutch + HtmlParse plugin....!!! On Tue, Jun 11, 2013 at 10:34 PM, AC Nutch <acnu...@gmail.com> wrote: > I'm a bit confused on where the requirement to *crawl* these sites comes > into it? From what you're saying it looks like you're just talking about > parsing the contents of a list of sites that you're trying to extract data > from. In which case there's not much of a use case for Nutch... or am I > confused? > > > On Tue, Jun 11, 2013 at 1:26 PM, Tony Mullins <tonymullins...@gmail.com > >wrote: > > > Yes all the web pages will have different HTML structure/layout and I > would > > have to identify/define a XPath expression for each one of them. > > > > But I am trying to come up with generic output format for these XPath > > expressions.... so whatever the XPath expression is I want to have result > > in lets say Field A , Field B , Field C . In some cases some of these > > fields could be blank as well. So I could map them to my Solr schema > > properly. > > > > In this regard I was hopping to get some help or guideline from your past > > experiences ... > > > > Thanks, > > Tony > > > > > > On Tue, Jun 11, 2013 at 10:09 PM, AC Nutch <acnu...@gmail.com> wrote: > > > > > Hi Tony, > > > > > > So if I understand correctly, you have 100+ web pages, each with a > > totally > > > different format that you're trying to extract separate/unrelated > pieces > > of > > > information from. If there's no connection between any of the web pages > > or > > > any of the pieces of information that you're trying to extract then > it's > > > pretty much unavoidable to have to provide separate identifiers and > cases > > > for finding each one. Markus' suggestion I believe is to just have a > > > "dictionary" file with URL as the key and XPath expression for the info > > > that you want as the value. No matter what crawling/parsing platform > > you're > > > using a solution of that sort is pretty much unavoidable with the > > > assumptions given. > > > > > > That being said, is there any common form that the data you're trying > to > > > extract from these pages follows? Is there a regex that could match it > or > > > anything else that might identify it in a common way? > > > > > > Alex > > > > > > > > > On Tue, Jun 11, 2013 at 12:59 PM, Tony Mullins < > tonymullins...@gmail.com > > > >wrote: > > > > > > > Hi Markus, > > > > > > > > I couldn't understand how can I avoid switch cases in your suggested > > > > idea.... > > > > > > > > I would have one plugin which will implement HtmlParseFilter and I > > would > > > > have to check the current URL by getting content.getUrl() and this > all > > > will > > > > be happening in same class so I would have to add swicth cases... I > may > > > > could add xpath expression for each site in separate files but to get > > > XPath > > > > expression I would have to decide which file I have to read and for > > that > > > I > > > > would have to add my this code logic in swith case.... > > > > > > > > Please correct me if I am getting this all wrong !!! > > > > > > > > And I think this is common requirement for web crawling solutions to > > get > > > > custom data from page... then are not there any such Nutch plugins > > > > available on web ? > > > > > > > > Thanks, > > > > Tony. > > > > > > > > > > > > On Tue, Jun 11, 2013 at 7:35 PM, Markus Jelsma > > > > <markus.jel...@openindex.io>wrote: > > > > > > > > > Hi, > > > > > > > > > > Yes, you should write a plugin that has a parse filter and indexing > > > > > filter. To ease maintenance you would want to have a file per > > > host/domain > > > > > containing XPath expressions, far easier that switch statements > that > > > need > > > > > to be recompiled. The indexing filter would then index the field > > values > > > > > extracted by your parse filter. > > > > > > > > > > Cheers, > > > > > Markus > > > > > > > > > > -----Original message----- > > > > > > From:Tony Mullins <tonymullins...@gmail.com> > > > > > > Sent: Tue 11-Jun-2013 16:07 > > > > > > To: user@nutch.apache.org > > > > > > Subject: Data Extraction from 100+ different sites... > > > > > > > > > > > > Hi, > > > > > > > > > > > > I have 100+ different sites ( and may be more will be added in > near > > > > > > future), I have to crawl them and extract my required information > > > from > > > > > each > > > > > > site. So each site would have its own extraction rule ( XPaths). > > > > > > > > > > > > So far I have seen there is no built-in mechanism in Nutch to > > fulfill > > > > my > > > > > > requirement and I may have to write custom HTMLParserFilter > > > extension > > > > > and > > > > > > IndexFilter plugin. > > > > > > > > > > > > And I may have to write 100+ switch cases in my plugin to handle > > the > > > > > > extraction rules of each site.... > > > > > > > > > > > > Is this the best way to handle my requirement or there is any > > better > > > > way > > > > > to > > > > > > handle it ? > > > > > > > > > > > > Thanks for your support & help. > > > > > > > > > > > > Tony. > > > > > > > > > > > > > > > > > > > > >