[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

nutch.newbie (JIRA) Sun, 11 Feb 2007 16:17:31 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472163
 ]


nutch.newbie commented on NUTCH-444:
------------------------------------

Hi: 

I have now done my initial test run with 10 000 + feeds in 3 batch. 

Batch 1
======
A total of 8000 feed ending URL ".rss" and RSS feeds only.. works out of the 
box.

Batch 2
======
A total of  3000 Atom feeds ending with ".xml" most of the time throws error 
during dedup process. Sometime gets parsed by parse-html 

Batch 3
======
A total of 2000 feeds endinf with all kinds of extension example .aspx, .php 
.jsp .ece and what not.. also throws error just like batch 2.

Batch 2 and Batch 3 provides same identical bug as before. Note I have ran only 
1 round of fetch. One thing that I am a bit confused is the following. Lets say 
you have a feed with 5 items i.e. 5 title 5 desc shouldn't the search result 
i.e. if you do url:feed.com shoot out 6 results? 1 for the main feed page which 
is the actual feed URL and the other 5 for the 5 items.. Currently I get only 1 
search result which is the feed URL.
Do I need to do 2 round of fetch? Cos things are getting parsed correctly.. 
maybe its because I don't have the indexing plugin i.e index-feed? no? I know 
we will work on it after Nutch-443 is done..but I want to get a 
clarification..thats all :-) Cheers!


Some log trace from Batch 1
===================
2007-02-12 00:55:23,607 DEBUG parse.ParseUtil - Parsing 
[http://rss.cnn.com/rss/cnn_marquee.rss] with [EMAIL PROTECTED]
2007-02-12 00:55:23,648 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-12 00:55:24,690 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 
pages/s, 0 kb/s, 
2007-02-12 00:55:25,020 WARN  parse.ParserFactory - ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not 
claim to support contentType: application/xhtml+xml
2007-02-12 00:55:25,225 DEBUG parse.html - 
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
 falling back to windows-1252
2007-02-12 00:55:25,225 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,255 DEBUG parse.html - 
http://rss.cnn.com/~r/rss/cnn_warpcnn/~3/88497144/american-voices-savings-lowest-since.html:
 falling back to windows-1252
2007-02-12 00:55:25,255 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - 
http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
 falling back to windows-1252
2007-02-12 00:55:25,277 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - 
http://rss.cnn.com/~r/rss/cnn_marquee/~3/88516140/anna-nicole-why.html: falling 
back to windows-1252
2007-02-12 00:55:25,278 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,691 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 
pages/s, 0 kb/s, 
2007-02-12 00:55:26,309 DEBUG parse.html - Meta tags for 
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
 base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, 
refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,310 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,315 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,316 DEBUG parse.html - Getting links...
2007-02-12 00:55:26,318 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2007-02-12 00:55:26,319 DEBUG parse.html - found 1 outlinks in 
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html
2007-02-12 00:55:26,321 DEBUG parse.html - Meta tags for 
http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
 base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, 
refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,321 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,330 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,331 DEBUG parse.html - Getting links...



> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

Reply via email to