date:20070209

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471620
 ] 

Dogacan Güney commented on NUTCH-443:
-

This is pretty much the merge of our work(except parse-rss, it kept failing on 
something like RSSContentUtils, so it returns a single parse for now). 

I also had a bug in MapWritable, this fixes it.

Since the code now compiles :), I ran junit tests over it. TestFetcher fails 
for some reason, will look into it.

Also, there is a bug in updatedb. If getParse returns keys different than 
content.getUrl and if these keys do not have entries in crawl_fetch, 
CrawlDbReducer will ignore those (assuming [correctly] that they are not 
fetched and there is no point in processing them). I will look into this too.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-443:


Attachment: NUTCH-443-draft-v1.patch

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-443:


Attachment: NUTCH-443-draft-v2.patch

Small update to the patch. Now all core junit tests pass.

Now, a question: When posting patches to JIRA, should I attach a new
patch as I find and fix my bugs(as I do it now), or should I wait till 
changes between successive patches include a couple of fixes?

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread nutch.newbie (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471703
 ] 

nutch.newbie commented on NUTCH-443:


I tried the patch with about 100 rss feed. Some problems

1. atom+xml content type gives trouble .. I am not sure if commons feedparser 
supports atom 1.0
2. In my case sometime the RSS URL doesn't end with .xml or .rss so some of the 
feeds got indexed like the way current nutch trunk do i.e as html.

Just some early feedback.. I will do some more testing this weekend. One 
question I do have is that - it still doesn't solve the problem of index just 
the RSS feeds.. even if I take away all my other parsers .. I still need HTML 
parser and index-basic.. maybe its time for index-rss? no?

Cheers

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread nutch.newbie (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471743
 ] 

nutch.newbie commented on NUTCH-443:


After doing some quick research seems like feedparser dont do atom 1.0. The 
comment below is not related to the api changes but rather feedparser it seems 
to be a dead end. maybe its time to seriously consider Rome 
https://rome.dev.java.net/ its being developed and has apache style lic. What 
others think about the change?

Regards

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Gal Nitzan (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471747
 ] 

Gal Nitzan commented on NUTCH-443:
--

Actually, I have tested Rome after feedparser failed with OutOfMemoy. Rome has 
the same problem as feedparser, both convert the feed to jdom first :(. I had 
to write my own implementation for rss parser with Stax.

Not Rome and neither feedparser could handle a 100K items feed, which isn't 
(probably) the common use case however it is not that far fetched use case.

HTH

Gal.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread nutch.newbie (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471754
 ] 

nutch.newbie commented on NUTCH-443:


Gal:

Thanks for the feedback and the test you have done. If Nutch is going to be 
open source version of google then maybe we should consider Stax. Could you 
please provide some info regarding your implementation.. probably in the 
mailing list..  Well my use case is going to be lot more then 100K items feed 
so I am interested to know more. I would like to hear others view of feedparser 
please beside the apache politics :-) The big question is -- Can anyone use 
Nutch to be a technorati or bloglines using feedparser? seems like no?

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Chris A. Mattmann (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471780
 ] 

Chris A. Mattmann commented on NUTCH-443:
-

Nutch Newbie,

   What exactly do you mean when you mention Apache politics? Feedparser wasn't 
selected because it was an Apache sub-project. In fact, that's as far from the 
truth as possible. I selected feedparser at the time (in May 2005 or so), 
because it was the only one of the three RSS reading APIs (informa, feedparser 
and rome) that I could figure out. The time that it took me to just understand 
rome, and informa was far greater than the time that it took me to write the 
entire RSS parser using feedparser.

   That said, things may have changed in the past year and a half. Perhaps Rome 
provides an easier API than feedparser now. Perhaps informa is faster. I'm not 
exactly sure what the answer to these and other questions on this subject are. 
However, before anything is said about feedparser, it's only fair that the 
folks who wrote it get to chime in. For that matter, it would probably be a 
good idea to contact Kevin Burton, the lead developer of the 
commons-feedparser, and ask him about its relationship to rome, and other apis 
such as Stax, or informa even...

Cheers,
  Chris


 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread nutch.newbie (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471806
 ] 

nutch.newbie commented on NUTCH-443:


Chris:

Frankly my comments are regarding feedparser and I must say I am great full for 
the rss-plugin and the hard work you put in. You have decided to go for 
feedparser cos you thought it was the correct solution. So please don't take 
this personally. 

According to SVN

http://svn.apache.org/viewvc/jakarta/commons/dormant/feedparser/trunk/ the last 
update was done regarding feedparser was 12 months ago plud there are no Atom 
1.0 support. This is how I like to put it and frankly it doesn't matter ..

1. The goal of nutch to be an alternative to open source google.
2. you can't have a dead end feedparser as your fundamental feed parsing 
soluttion where the project is not moving for the last 12 months!  Well go 
figure why people think its apache politics.

Sorry I brusted like this. in one hand nutch would like to preach that it is 
the alternative to google and on the other hand it uses technology that is no 
longer active ..thats all. 



 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dogacan Güney updated NUTCH-443:


Attachment: NUTCH-443-draft-v3.patch

new patch, contains a possible fix for CrawlDbReducer problem.

This version finally works! (well, not really, but I can definitely say that it 
almost kind of works..sometimes:)

I have two main issues with this patch:

1) If fetcher is in parsing mode, and parse returns a SUCCESS_REDIRECT,
fetcher handles this redirect. After this change, fetcher checks if the first 
element of parseMap.values() (whatever that may be) has a SUCCESS_REDIRECT. It 
is possible that a multi-entry parseMap has an parse element with a 
SUCCESS_REDIRECT that is not the first element. (perhaps we can first check if 
parseMap.get(originalUrl) returns a parse, if not use first element of 
parseMap.values()? )

2) To be able to pass fetch time to not-actually-fetched-but-generated-in-parse 
urls, I first put the original fetch time to content and then pass the value in 
content to all elements in parseMap.values(). I guess this approach is not very 
optimal since it passes fetch time around a lot.


 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471857
 ] 

Dogacan Güney commented on NUTCH-443:
-

nutch.newbie:

I fail to see what the problem is. If feedparser doesn't work for you, Nutch 
has a very powerful plugin api. Just write another plugin that uses Rome or 
whatever. If you are willing to share it, post it to JIRA explaining why your 
plugin is better than the current one. Unless there is a license-related 
problem, I am sure that nutch developers will put it in.

PS: I actually have a half-baked plugin that uses Rome, and I will work on rss 
index and rss query plugins once this issue is resolved.

 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, parse-map-core-draft-v1.patch, 
 parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Renaud Richardet (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Richardet updated NUTCH-443:
---

Attachment: NUTCH-443-draft-v4.patch

Hi Dogacan,

Thanks for merging the patches, good teamwork!

I worked on the RSS parser, it should now basically work.
Now, all core and plugin-tests pass, except for TestRSSparser, will work on 
that. Once this is in place, I will have a look at the other issues with fetch 
time, etc.

I merged my changes with your patch, version 3.


 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread Renaud Richardet (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471878
 ] 

Renaud Richardet commented on NUTCH-443:


Nutch Newbie, Gal, Chris

It's great that you discuss alternative RSS parsing libraries, bug the 
resolution of this bug does not depends on which underlying RSS library is used 
in RSSParser. Would you mind moving the conversation to the new issue I created 
for it (NUTCH-444), thanks a bunch.



 allow parsers to return multiple Parse object, this will speed up the rss 
 parser
 

 Key: NUTCH-443
 URL: https://issues.apache.org/jira/browse/NUTCH-443
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
 Fix For: 0.9.0

 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
 NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, 
 parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff


 allow Parser#parse to return a MapString,Parse. This way, the RSS parser 
 can return multiple parse objects, that will all be indexed separately. 
 Advantage: no need to fetch all feed-items separately.
 see the discussion at 
 http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-09 Thread nutch.newbie (JIRA)

[
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471952
]

nutch.newbie commented on NUTCH-444:

Renaud :

Thanks for moving the discussion here. First to answer your question yes its
based on mime type detectation problem. The goal of the trial was to see if one
could make just a feed search site i.e just feeds but I didn't succeed. I will
give it a go over the weekend.

Dogcan:

Yes, one could just replace the feedparser with rome or stax and submit back
here or use it internally. My discussion point was to see how others see about
it and maybe there are others who have ran into problem and their experience.
As Gal pointed out about rome (At least it is being further developed) and stax
and you pointed out that you are doing something with rome.. I just wanted to
know what other think and their experience thats all. Yes you are correct i
posted it in the wrong forum nutch-443. But Nutch-443 started off as someone
having trouble with RSS and it is important in my view to discuss the issue as
we are using (feedparser) which is not going to solve the original issue if one
tries to create just a RSS search engine. Nutch -443 would have not surfaced in
the first place.

I am looking forward to that day when I can use nutch just to do rss feed
search engine so Dogcan I am very interested in your rome impl. maybe you can
post the code here so that i can participate.

Possibly use a different library to parse RSS feed for improved performance
and compatibility
-

Key: NUTCH-444
URL: https://issues.apache.org/jira/browse/NUTCH-444
Project: Nutch
Issue Type: Improvement
Components: fetcher
Affects Versions: 0.9.0
Reporter: Renaud Richardet
Priority: Minor
Fix For: 0.9.0

As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
library (feedparser) has the following issues:
- OutOfMemory when parsing 100k feeds, since it has to convert the feed to
jdom first
- no support for Atom 1.0
- there has been no development in the last year
Alternatives are:
- Rome
- Informa
- custom implementation based on Stax
- ??

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

14 matches

Site Navigation

Mail list logo

Footer information