[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472078
 ] 

Otis Gospodnetic commented on NUTCH-444:


The ASF FeedParser you are talking about has, I believe, continued its life 
udner Kevin Burton in TailRank:  http://tailrank.com/code.php
Atom 1.0 and everything else supported.


 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-443:


Attachment: NUTCH-443-draft-v5.patch

New version. Now indexing also works but has a catch. Many ScoringFilter 
functions take both a dbDatum and a fetchDatum. After this change fetchDatum 
may be null as that url may not be fetched but generated in parse. This does 
not affect scoring-opic, though.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-443:


Attachment: NUTCH-443-draft-v6.patch

Oops... I forgot to merge Renaud Richardet's work.

This is same as v5 except it includes Renaud Richardet's changes from v4.

I am really really sorry about this. Will be more careful next time.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
 Assigned To: Chris A. Mattmann
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
 NUTCH-443-draft-v6.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-444:


Attachment: parse-feed-v2.tar.bz2

Updated parse-feed plugin. Still not ready for any serious use, but I think I 
fixed the problems with indexing and dedup. Use it with NUTCH-443's v5 patch.

nutch.newbie: I change parse-plugins.xml as you do. For this plugin to work, 
you also have to change default signature to TextProfileSignature(because 
MD5Signature takes the hash of content, which is the same for every element in 
a parseMap). This is done by adding:
property
  namedb.signature.class/name
  valueorg.apache.nutch.crawl.TextProfileSignature/value
/property

to your nutch-site.xml.


 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread nutch.newbie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472121
 ] 

nutch.newbie commented on NUTCH-444:


A Big thank you! It works with the latest patch etc. All other reported 
previous bugs are gone now :-) About my test tonight .. I just want to run it 
one a decent set of urls to collect more bugs nothing more :-) 

Cheers


 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the last year
 Alternatives are:
 - Rome 
 - Informa
 - custom implementation based on Stax
 - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: api.RegexURLFilterBase - Configuration Resources

2007-02-11 Thread Tobias Zahn
Again, thank you for your help.
In the end, I had slightly wrong configs for my plugin, but now it seems
to work. But since nutch makes no output on the commandline anymore, I
can't find out if everything is correct in the end (readdb -stats).

I don't know why it is that way - I haven't changed anything.
It would be create if someone would have an idea what to do now!

My nutch version is 0.8.


Best regards,
Tobias Zahn


[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread nutch.newbie (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472163
 ] 

nutch.newbie commented on NUTCH-444:


Hi: 

I have now done my initial test run with 10 000 + feeds in 3 batch. 

Batch 1
==
A total of 8000 feed ending URL .rss and RSS feeds only.. works out of the 
box.

Batch 2
==
A total of  3000 Atom feeds ending with .xml most of the time throws error 
during dedup process. Sometime gets parsed by parse-html 

Batch 3
==
A total of 2000 feeds endinf with all kinds of extension example .aspx, .php 
.jsp .ece and what not.. also throws error just like batch 2.

Batch 2 and Batch 3 provides same identical bug as before. Note I have ran only 
1 round of fetch. One thing that I am a bit confused is the following. Lets say 
you have a feed with 5 items i.e. 5 title 5 desc shouldn't the search result 
i.e. if you do url:feed.com shoot out 6 results? 1 for the main feed page which 
is the actual feed URL and the other 5 for the 5 items.. Currently I get only 1 
search result which is the feed URL.
Do I need to do 2 round of fetch? Cos things are getting parsed correctly.. 
maybe its because I don't have the indexing plugin i.e index-feed? no? I know 
we will work on it after Nutch-443 is done..but I want to get a 
clarification..thats all :-) Cheers!


Some log trace from Batch 1
===
2007-02-12 00:55:23,607 DEBUG parse.ParseUtil - Parsing 
[http://rss.cnn.com/rss/cnn_marquee.rss] with [EMAIL PROTECTED]
2007-02-12 00:55:23,648 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-12 00:55:24,690 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 
pages/s, 0 kb/s, 
2007-02-12 00:55:25,020 WARN  parse.ParserFactory - ParserFactory:Plugin: 
org.apache.nutch.parse.html.HtmlParser mapped to contentType 
application/xhtml+xml via parse-plugins.xml, but its plugin.xml file does not 
claim to support contentType: application/xhtml+xml
2007-02-12 00:55:25,225 DEBUG parse.html - 
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
 falling back to windows-1252
2007-02-12 00:55:25,225 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,255 DEBUG parse.html - 
http://rss.cnn.com/~r/rss/cnn_warpcnn/~3/88497144/american-voices-savings-lowest-since.html:
 falling back to windows-1252
2007-02-12 00:55:25,255 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - 
http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
 falling back to windows-1252
2007-02-12 00:55:25,277 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,277 DEBUG parse.html - 
http://rss.cnn.com/~r/rss/cnn_marquee/~3/88516140/anna-nicole-why.html: falling 
back to windows-1252
2007-02-12 00:55:25,278 DEBUG parse.html - Parsing...
2007-02-12 00:55:25,691 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 
pages/s, 0 kb/s, 
2007-02-12 00:55:26,309 DEBUG parse.html - Meta tags for 
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html:
 base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, 
refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,310 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,315 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,316 DEBUG parse.html - Getting links...
2007-02-12 00:55:26,318 WARN  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default
2007-02-12 00:55:26,319 DEBUG parse.html - found 1 outlinks in 
http://www.cnn.com/HEALTH/blogs/paging.dr.gupta/2007/02/handling-friends-diagnosis.html
2007-02-12 00:55:26,321 DEBUG parse.html - Meta tags for 
http://rss.cnn.com/~r/rss/cnn_ac360blog/~3/88245057/new-orleans-parents-fear-losing-kids.html:
 base=null, noCache=false, noFollow=false, noIndex=false, refresh=false, 
refreshHref=null
 * general tags:
 * http-equiv tags:

2007-02-12 00:55:26,321 DEBUG parse.html - Getting text...
2007-02-12 00:55:26,330 DEBUG parse.html - Getting title...
2007-02-12 00:55:26,331 DEBUG parse.html - Getting links...



 Possibly use a different library to parse RSS feed for improved performance 
 and compatibility
 -

 Key: NUTCH-444
 URL: https://issues.apache.org/jira/browse/NUTCH-444
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: parse-feed-v2.tar.bz2, parse-feed.tar.bz2


 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
 library (feedparser) has the following issues:
 - OutOfMemory when parsing  100k feeds, since it has to convert the feed to 
 jdom first
 - no support for Atom 1.0
 - there has been no development in the