[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505964 ]
nutch.newbie commented on NUTCH-444: ------------------------------------ Thanks Chris and Dogacan.. Glad to see things are moving forward. Couple of issues.. 1. I am must have made some configuration error cos I still have to make 2 trip rather then 1 when I am parsing feed. What I am trying to achieve is Feed URLs like the following http://rss.cnn.com/rss/cnn_topstories.rss I just want to crawl once and collect all the items as individual lucene documents. i.e. lets say they have 25 news then I would like to have 25 search result when I try a query "cnn". This is probably my own config error cos I haven't been playing around with nutch 0.9. 2. Feed URL that doesn't end with .rss or .atom etc gets picked up by html parser rather then parse-feed. All my seed URL are feed URL and it works great on Safari.. how do I fix this? suggestions? 3. Lot of my feed provider provide feeds in a non standard way it would be nice if I could decide what are the elements I would like to parse and what will be the default for this if author name or author email is missing from the provider info. Don't if this helps in anyway just providing my findings after running couple of test. I will do some more test during the weekend. Thanks again for all the help. > Possibly use a different library to parse RSS feed for improved performance > and compatibility > --------------------------------------------------------------------------------------------- > > Key: NUTCH-444 > URL: https://issues.apache.org/jira/browse/NUTCH-444 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assignee: Chris A. Mattmann > Priority: Minor > Fix For: 1.0.0 > > Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt, > NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 > > > As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current > library (feedparser) has the following issues: > - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to > jdom first > - no support for Atom 1.0 > - there has been no development in the last year > Alternatives are: > - Rome > - Informa > - custom implementation based on Stax > - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.