[
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472163
]
nutch.newbie commented on NUTCH-444:
------------------------------------
Hi:
I have now done my initial test run with 10 000 + feeds in 3 batch.
Batch 1
======
A total of 8000 feed ending URL ".rss" and RSS feeds only.. works out of the
box.
Batch 2
======
A total of 3000 Atom feeds ending with ".xml" most of the time throws error
during dedup process. Sometime gets parsed by parse-html
Batch 3
======
A total of 2000 feeds endinf with all kinds of extension example .aspx, .php
.jsp .ece and what not.. also throws error just like batch 2.
Batch 2 and Batch 3 provides same identical bug as before. Note I have ran only
1 round of fetch. One thing that I am a bit confused is the following. Lets say
you have a feed with 5 items i.e. 5 title 5 desc shouldn't the search result
i.e. if you do url:feed.com shoot out 6 results? 1 for the main feed page which
is the actual feed URL and the other 5 for the 5 items.. Currently I get only 1
search result which is the feed URL.
Do I need to do 2 round of fetch? Cos things are getting parsed correctly..
maybe its because I don't have the indexing plugin i.e index-feed? no? I know
we will work on it after Nutch-443 is done..but I want to get a
clarification..thats all :-) Cheers!
Some log trace from Batch 1
===================
2007-02-12 00:55:23,607 DEBUG parse.ParseUtil - Parsing
[http://rss.cnn.com/rss/cnn_marquee.rss] with [EMAIL PROTECTED]
2007-02-12 00:55:23,648 INFO mapred.JobClient - map 100% reduce 0%
2007-02-12 00:55:24,690 INFO mapred.LocalJobRunner - 0 pages, 0 errors, 0.0
pages/s, 0 kb/s,
2007-02-12 00:55:25,020 WARN parse.ParserFactory - ParserFactory:Plugin:
org.apache.nutch.parse.html.HtmlParser mapped to contentType
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not
claim to support contentType: application/xhtml+xml
2007-02-12 00:55:25,225 DEBUG parse.html -
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
falling back to windows-1252
2007-02-12 00:55:25,225 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,255 DEBUG parse.html -
http://rss.cnn.com/~r/rss/cnn_warpcnn/~3/88497144/american-voices-savings-lowest-since.html:
falling back to windows-1252
2007-02-12 00:55:25,255 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html -
http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
falling back to windows-1252
2007-02-12 00:55:25,277 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html -
http://rss.cnn.com/~r/rss/cnn_marquee/~3/88516140/anna-nicole-why.html: falling
back to windows-1252
2007-02-12 00:55:25,278 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,691 INFO mapred.LocalJobRunner - 0 pages, 0 errors, 0.0
pages/s, 0 kb/s,
2007-02-12 00:55:26,309 DEBUG parse.html - Meta tags for
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
base=null, noCache=false, noFollow=false, noIndex=false, refresh=false,
refreshHref=null
* general tags:
* http-equiv tags:
2007-02-12 00:55:26,310 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,315 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,316 DEBUG parse.html - Getting links...
2007-02-12 00:55:26,318 WARN regex.RegexURLNormalizer - can't find rules for
scope 'outlink', using default
2007-02-12 00:55:26,319 DEBUG parse.html - found 1 outlinks in
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html
2007-02-12 00:55:26,321 DEBUG parse.html - Meta tags for
http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
base=null, noCache=false, noFollow=false, noIndex=false, refresh=false,
refreshHref=null
* general tags:
* http-equiv tags:
2007-02-12 00:55:26,321 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,330 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,331 DEBUG parse.html - Getting links...
> Possibly use a different library to parse RSS feed for improved performance
> and compatibility
> ---------------------------------------------------------------------------------------------
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 0.9.0
> Reporter: Renaud Richardet
> Priority: Minor
> Fix For: 0.9.0
>
> Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers