[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472078 ] Otis Gospodnetic commented on NUTCH-444: The ASF FeedParser you are talking about has, I believe, continued its life udner Kevin Burton in TailRank: http://tailrank.com/code.php Atom 1.0 and everything else supported. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v5.patch New version. Now indexing also works but has a catch. Many ScoringFilter functions take both a dbDatum and a fetchDatum. After this change fetchDatum may be null as that url may not be fetched but generated in parse. This does not affect scoring-opic, though. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v6.patch Oops... I forgot to merge Renaud Richardet's work. This is same as v5 except it includes Renaud Richardet's changes from v4. I am really really sorry about this. Will be more careful next time. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assigned To: Chris A. Mattmann Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-444: Attachment: parse-feed-v2.tar.bz2 Updated parse-feed plugin. Still not ready for any serious use, but I think I fixed the problems with indexing and dedup. Use it with NUTCH-443's v5 patch. nutch.newbie: I change parse-plugins.xml as you do. For this plugin to work, you also have to change default signature to TextProfileSignature(because MD5Signature takes the hash of content, which is the same for every element in a parseMap). This is done by adding: property namedb.signature.class/name valueorg.apache.nutch.crawl.TextProfileSignature/value /property to your nutch-site.xml. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472121 ] nutch.newbie commented on NUTCH-444: A Big thank you! It works with the latest patch etc. All other reported previous bugs are gone now :-) About my test tonight .. I just want to run it one a decent set of urls to collect more bugs nothing more :-) Cheers Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: api.RegexURLFilterBase - Configuration Resources
Again, thank you for your help. In the end, I had slightly wrong configs for my plugin, but now it seems to work. But since nutch makes no output on the commandline anymore, I can't find out if everything is correct in the end (readdb -stats). I don't know why it is that way - I haven't changed anything. It would be create if someone would have an idea what to do now! My nutch version is 0.8. Best regards, Tobias Zahn
[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472163 ] nutch.newbie commented on NUTCH-444: Hi: I have now done my initial test run with 10 000 + feeds in 3 batch. Batch 1 == A total of 8000 feed ending URL .rss and RSS feeds only.. works out of the box. Batch 2 == A total of 3000 Atom feeds ending with .xml most of the time throws error during dedup process. Sometime gets parsed by parse-html Batch 3 == A total of 2000 feeds endinf with all kinds of extension example .aspx, .php .jsp .ece and what not.. also throws error just like batch 2. Batch 2 and Batch 3 provides same identical bug as before. Note I have ran only 1 round of fetch. One thing that I am a bit confused is the following. Lets say you have a feed with 5 items i.e. 5 title 5 desc shouldn't the search result i.e. if you do url:feed.com shoot out 6 results? 1 for the main feed page which is the actual feed URL and the other 5 for the 5 items.. Currently I get only 1 search result which is the feed URL. Do I need to do 2 round of fetch? Cos things are getting parsed correctly.. maybe its because I don't have the indexing plugin i.e index-feed? no? I know we will work on it after Nutch-443 is done..but I want to get a clarification..thats all :-) Cheers! Some log trace from Batch 1 === 2007-02-12 00:55:23,607 DEBUG parse.ParseUtil - Parsing [http://rss.cnn.com/rss/cnn_marquee.rss] with [EMAIL PROTECTED] 2007-02-12 00:55:23,648 INFO mapred.JobClient - map 100% reduce 0% 2007-02-12 00:55:24,690 INFO mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 2007-02-12 00:55:25,020 WARN parse.ParserFactory - ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser mapped to contentType application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not claim to support contentType: application/xhtml+xml 2007-02-12 00:55:25,225 DEBUG parse.html - http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html: falling back to windows-1252 2007-02-12 00:55:25,225 DEBUG parse.html - Parsing... 2007-02-12 00:55:25,255 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_warpcnn/~3/88497144/american-voices-savings-lowest-since.html: falling back to windows-1252 2007-02-12 00:55:25,255 DEBUG parse.html - Parsing... 2007-02-12 00:55:25,277 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html: falling back to windows-1252 2007-02-12 00:55:25,277 DEBUG parse.html - Parsing... 2007-02-12 00:55:25,277 DEBUG parse.html - http://rss.cnn.com/~r/rss/cnn_marquee/~3/88516140/anna-nicole-why.html: falling back to windows-1252 2007-02-12 00:55:25,278 DEBUG parse.html - Parsing... 2007-02-12 00:55:25,691 INFO mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 pages/s, 0 kb/s, 2007-02-12 00:55:26,309 DEBUG parse.html - Meta tags for http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null * general tags: * http-equiv tags: 2007-02-12 00:55:26,310 DEBUG parse.html - Getting text... 2007-02-12 00:55:26,315 DEBUG parse.html - Getting title... 2007-02-12 00:55:26,316 DEBUG parse.html - Getting links... 2007-02-12 00:55:26,318 WARN regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2007-02-12 00:55:26,319 DEBUG parse.html - found 1 outlinks in http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html 2007-02-12 00:55:26,321 DEBUG parse.html - Meta tags for http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html: base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, refreshHref=null * general tags: * http-equiv tags: 2007-02-12 00:55:26,321 DEBUG parse.html - Getting text... 2007-02-12 00:55:26,330 DEBUG parse.html - Getting title... 2007-02-12 00:55:26,331 DEBUG parse.html - Getting links... Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the