[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

nutch.newbie (JIRA) Mon, 18 Jun 2007 14:59:47 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505964
 ]


nutch.newbie commented on NUTCH-444:
------------------------------------

Thanks Chris and Dogacan.. Glad to see things are moving forward. Couple of 
issues..

1. I am must have made some configuration error cos I still have to make 2 trip 
rather then 1 when I am parsing feed. What I am trying to achieve is Feed URLs 
like the following http://rss.cnn.com/rss/cnn_topstories.rss I just want to 
crawl once and collect all the items as individual lucene documents. i.e. lets 
say they have 25 news then I would like to have 25 search result when I try a 
query "cnn". This is probably my own config error cos I haven't been playing 
around with nutch 0.9.

2. Feed URL that doesn't end with .rss or .atom etc gets picked up by html 
parser rather then parse-feed. All my seed URL are feed URL and it works great 
on Safari.. how do I fix this? suggestions?

3. Lot of my feed provider provide feeds in a non standard way it would be nice 
if I could decide what are the elements I would like to parse and what will be 
the default for this if author name or author email is missing from the 
provider info.

Don't if this helps in anyway just providing my findings after running couple 
of test. I will do some more test during the weekend.

Thanks again for all the help.


> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, 
> NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

Reply via email to