[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471967
 ] 

nutch.newbie commented on NUTCH-444:
------------------------------------

Well, Lets try this again in terms of feedparser. 

I completely disagree that a dormant project which doesn't support newer 
protocol nor shown any activity for the last 12 months is not a reason for 
change. Let us just focus on the publicly available stats from syndic8.com 
(They don't have all the feed but they have enough data to get  the big picture)

http://www.syndic8.com/stats.php?Section=feeds#tabtable

Total Feeds:            495,614
Atom Feeds:             84,746
RSS Feeds:              397,565

Roughly 20-25% of the feed are Atom feed. So "Nutch default installation" 
misses 25% of the "feed web". Imagine having a search engine site that can only 
do HTML 3.0 and nothing more cos the project who developed the great HTML 3.0 
lib is not active. Now you say well thats HTML its a different issue.

Well, blogs and feeds are growing on trees and we can't afford to miss 25% of 
the blogs/feeds

So is that a good reason to still stick with commons feedparser? 

Cheers

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to