Hi harsh, On Wed, Mar 30, 2016 at 5:18 AM, <user-digest-h...@nutch.apache.org> wrote:
> From: harsh <harsh.sha...@orkash.com> > To: user@nutch.apache.org > Cc: > Date: Tue, 29 Mar 2016 09:30:07 +0530 > Subject: Re: Get all the feed metadata > Hi Lewis > > seedurl.txt file is as follows > > http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss > http://timesofindia.feedsportal.com/c/33039/f/533922/index.rss > http://www.thehindu.com/news/cities/Delhi/?service=rss > http://www.thehindu.com/news/international/?service=rss > http://indianexpress.com/section/sports/cricket/feed/ > http://indianexpress.com/section/sports/feed/ > http://news.google.co.in/news?cf=all&hl=en&pz=1&ned=in&output=rss > > While executing parse phase, all the URLs extracting form the rss-feeds > are kept as out_links with corresponding title. but fail to extract and > store pub_date,author etc for each URL(through ROME API which is already > used in nutch). > > I tried the first link you've posted above. When using the 'feed' plugin I get the following lmcgibbn@LMC-032857 /usr/local/nutch(master) $ ./runtime/local/bin/nutch parsechecker -dumpText " http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss" fetching: http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss robots.txt whitelist not configured. parsing: http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss contentType: application/rss+xml signature: 7092ca36133abb3961b70d355f28762a --------- Url --------------- http://timesofindia.feedsportal.com/c/33039/f/533917/s/4e98636c/sc/24/l/0Ltimesofindia0Bindiatimes0N0Cworld0Crest0Eof0Eworld0CWhy0Ewar0Eis0Ebig0Ebusiness0Carticleshow0C516113640Bcms/story01.htm --------- ParseData --------- Version: 5 Status: failed(2,0) Title: Why war is big business Outlinks: 0 Content Metadata: nutch.fetch.time=1459348679904 ETag=1459340949000 Date=Wed, 30 Mar 2016 14:33:25 GMT Content-Length=6905 nutch.crawl.score=0.0 Last-Modified=Wed, 30 Mar 2016 12:29:09 GMT Content-Encoding=gzip Set-Cookie=MF2=scnkvxe995mb; domain=.feedsportal.com; expires=Fri, 30-Mar-18 14:33:25 GMT; path=/ Connection=close Server=FeedsPortal Parse Metadata: feed= http://timesofindia.indiatimes.com/articlelist/296589292.cms published=1459318397000 --------- ParseText --------- <a href=" http://timesofindia.indiatimes.com/world/rest-of-world/Why-war-is-big-business/articleshow/51611364.cms"><img border="0" hspace="10" align="left" style="margin-top:3px;margin-right:5px;" src=" http://timesofindia.indiatimes.com/photo/51611364.cms" /></a>A quick look at the revenues of the world’s top arms sellers tells a tragic story - their earnings are higher than the GDP of 140 countries. Worse, global defence spending continues to rise.<br clear='all'/><br/><br/><a href=" http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/1/rc.htm" rel="nofollow"><img src=" http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/1/rc.img" border="0"/></a><br/><br/><a href=" http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/2/rc.htm" rel="nofollow"><img src=" http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/2/rc.img" border="0"/></a><br/><br/><a href=" http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/3/rc.htm" rel="nofollow"><img src=" http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/3/rc.img" border="0"/></a><br/><br/><a href=" http://da.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/a2.htm"><img src=" http://da.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/a2.img" border="0"/></a><img width="1" height="1" src=" http://pi.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/a2t.img" border="0"/><img width='1' height='1' src=' http://timesofindia.feedsportal.com/c/33039/f/533917/s/4e98636c/sc/24/mf.gif' border='0'/> .... many more feeds are then also parsed! The field you are always interested in is 'Parse Metadata:'. As you can see here, it gives you the feed title and published data. In all honesty the only XML tags I can see at <title>, <description> and <pubDate> so this seems absolutely fine to me. If the <author> is not encoded within XML then it is not possible to get it. hth Lewis