Re: Get all the feed metadata

Lewis John Mcgibbney Wed, 30 Mar 2016 07:49:28 -0700

Hi harsh,

On Wed, Mar 30, 2016 at 5:18 AM, <user-digest-h...@nutch.apache.org> wrote:


> From: harsh <harsh.sha...@orkash.com>
> To: user@nutch.apache.org
> Cc:
> Date: Tue, 29 Mar 2016 09:30:07 +0530
> Subject: Re: Get all the feed metadata
> Hi Lewis
>
> seedurl.txt file is as follows
>
> http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss
> http://timesofindia.feedsportal.com/c/33039/f/533922/index.rss
> http://www.thehindu.com/news/cities/Delhi/?service=rss
> http://www.thehindu.com/news/international/?service=rss
> http://indianexpress.com/section/sports/cricket/feed/
> http://indianexpress.com/section/sports/feed/
> http://news.google.co.in/news?cf=all&hl=en&pz=1&ned=in&output=rss
>
> While executing parse phase, all the URLs extracting form the rss-feeds
> are kept as out_links with corresponding title. but fail to extract and
> store  pub_date,author etc for each URL(through ROME API which is already
> used in nutch).
>
>
I tried the first link you've posted above. When using the 'feed' plugin I
get the following

lmcgibbn@LMC-032857 /usr/local/nutch(master) $ ./runtime/local/bin/nutch
parsechecker -dumpText "
http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss";
fetching: http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss
robots.txt whitelist not configured.
parsing: http://timesofindia.feedsportal.com/c/33039/f/533917/index.rss
contentType: application/rss+xml
signature: 7092ca36133abb3961b70d355f28762a
---------
Url
---------------

http://timesofindia.feedsportal.com/c/33039/f/533917/s/4e98636c/sc/24/l/0Ltimesofindia0Bindiatimes0N0Cworld0Crest0Eof0Eworld0CWhy0Ewar0Eis0Ebig0Ebusiness0Carticleshow0C516113640Bcms/story01.htm
---------
ParseData
---------

Version: 5
Status: failed(2,0)
Title: Why war is big business
Outlinks: 0
Content Metadata: nutch.fetch.time=1459348679904 ETag=1459340949000
Date=Wed, 30 Mar 2016 14:33:25 GMT Content-Length=6905
nutch.crawl.score=0.0 Last-Modified=Wed, 30 Mar 2016 12:29:09 GMT
Content-Encoding=gzip Set-Cookie=MF2=scnkvxe995mb; domain=.feedsportal.com;
expires=Fri, 30-Mar-18 14:33:25 GMT; path=/ Connection=close
Server=FeedsPortal
Parse Metadata: feed=
http://timesofindia.indiatimes.com/articlelist/296589292.cms
published=1459318397000
---------
ParseText
---------

<a href="
http://timesofindia.indiatimes.com/world/rest-of-world/Why-war-is-big-business/articleshow/51611364.cms";><img
border="0" hspace="10" align="left"
style="margin-top:3px;margin-right:5px;" src="
http://timesofindia.indiatimes.com/photo/51611364.cms"; /></a>A quick look
at the revenues of the world’s top arms sellers tells a tragic story -
their earnings are higher than the GDP of 140 countries. Worse, global
defence spending continues to rise.<br clear='all'/><br/><br/><a href="
http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/1/rc.htm";
rel="nofollow"><img src="
http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/1/rc.img";
border="0"/></a><br/><br/><a href="
http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/2/rc.htm";
rel="nofollow"><img src="
http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/2/rc.img";
border="0"/></a><br/><br/><a href="
http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/3/rc.htm";
rel="nofollow"><img src="
http://rc.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/rc/3/rc.img";
border="0"/></a><br/><br/><a href="
http://da.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/a2.htm";><img
src="
http://da.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/a2.img";
border="0"/></a><img width="1" height="1" src="
http://pi.feedsportal.com/r/247396247002/u/0/f/533917/c/33039/s/4e98636c/sc/24/a2t.img";
border="0"/><img width='1' height='1' src='
http://timesofindia.feedsportal.com/c/33039/f/533917/s/4e98636c/sc/24/mf.gif'
border='0'/>
....
many more feeds are then also parsed!
The field you are always interested in is 'Parse Metadata:'. As you can see
here, it gives you the feed title and published data. In all honesty the
only XML tags I can see at <title>, <description> and <pubDate> so this
seems absolutely fine to me.
If the <author> is not encoded within XML then it is not possible to get it.
hth
Lewis

Re: Get all the feed metadata

Reply via email to