[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471620 ] Dogacan Güney commented on NUTCH-443: - This is pretty much the merge of our work(except parse-rss, it kept failing on something like RSSContentUtils, so it returns a single parse for now). I also had a bug in MapWritable, this fixes it. Since the code now compiles :), I ran junit tests over it. TestFetcher fails for some reason, will look into it. Also, there is a bug in updatedb. If getParse returns keys different than content.getUrl and if these keys do not have entries in crawl_fetch, CrawlDbReducer will ignore those (assuming [correctly] that they are not fetched and there is no point in processing them). I will look into this too. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v1.patch allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v2.patch Small update to the patch. Now all core junit tests pass. Now, a question: When posting patches to JIRA, should I attach a new patch as I find and fix my bugs(as I do it now), or should I wait till changes between successive patches include a couple of fixes? allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471703 ] nutch.newbie commented on NUTCH-443: I tried the patch with about 100 rss feed. Some problems 1. atom+xml content type gives trouble .. I am not sure if commons feedparser supports atom 1.0 2. In my case sometime the RSS URL doesn't end with .xml or .rss so some of the feeds got indexed like the way current nutch trunk do i.e as html. Just some early feedback.. I will do some more testing this weekend. One question I do have is that - it still doesn't solve the problem of index just the RSS feeds.. even if I take away all my other parsers .. I still need HTML parser and index-basic.. maybe its time for index-rss? no? Cheers allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471743 ] nutch.newbie commented on NUTCH-443: After doing some quick research seems like feedparser dont do atom 1.0. The comment below is not related to the api changes but rather feedparser it seems to be a dead end. maybe its time to seriously consider Rome https://rome.dev.java.net/ its being developed and has apache style lic. What others think about the change? Regards allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471747 ] Gal Nitzan commented on NUTCH-443: -- Actually, I have tested Rome after feedparser failed with OutOfMemoy. Rome has the same problem as feedparser, both convert the feed to jdom first :(. I had to write my own implementation for rss parser with Stax. Not Rome and neither feedparser could handle a 100K items feed, which isn't (probably) the common use case however it is not that far fetched use case. HTH Gal. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471754 ] nutch.newbie commented on NUTCH-443: Gal: Thanks for the feedback and the test you have done. If Nutch is going to be open source version of google then maybe we should consider Stax. Could you please provide some info regarding your implementation.. probably in the mailing list.. Well my use case is going to be lot more then 100K items feed so I am interested to know more. I would like to hear others view of feedparser please beside the apache politics :-) The big question is -- Can anyone use Nutch to be a technorati or bloglines using feedparser? seems like no? allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471780 ] Chris A. Mattmann commented on NUTCH-443: - Nutch Newbie, What exactly do you mean when you mention Apache politics? Feedparser wasn't selected because it was an Apache sub-project. In fact, that's as far from the truth as possible. I selected feedparser at the time (in May 2005 or so), because it was the only one of the three RSS reading APIs (informa, feedparser and rome) that I could figure out. The time that it took me to just understand rome, and informa was far greater than the time that it took me to write the entire RSS parser using feedparser. That said, things may have changed in the past year and a half. Perhaps Rome provides an easier API than feedparser now. Perhaps informa is faster. I'm not exactly sure what the answer to these and other questions on this subject are. However, before anything is said about feedparser, it's only fair that the folks who wrote it get to chime in. For that matter, it would probably be a good idea to contact Kevin Burton, the lead developer of the commons-feedparser, and ask him about its relationship to rome, and other apis such as Stax, or informa even... Cheers, Chris allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471806 ] nutch.newbie commented on NUTCH-443: Chris: Frankly my comments are regarding feedparser and I must say I am great full for the rss-plugin and the hard work you put in. You have decided to go for feedparser cos you thought it was the correct solution. So please don't take this personally. According to SVN http://svn.apache.org/viewvc/jakarta/commons/dormant/feedparser/trunk/ the last update was done regarding feedparser was 12 months ago plud there are no Atom 1.0 support. This is how I like to put it and frankly it doesn't matter .. 1. The goal of nutch to be an alternative to open source google. 2. you can't have a dead end feedparser as your fundamental feed parsing soluttion where the project is not moving for the last 12 months! Well go figure why people think its apache politics. Sorry I brusted like this. in one hand nutch would like to preach that it is the alternative to google and on the other hand it uses technology that is no longer active ..thats all. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v3.patch new patch, contains a possible fix for CrawlDbReducer problem. This version finally works! (well, not really, but I can definitely say that it almost kind of works..sometimes:) I have two main issues with this patch: 1) If fetcher is in parsing mode, and parse returns a SUCCESS_REDIRECT, fetcher handles this redirect. After this change, fetcher checks if the first element of parseMap.values() (whatever that may be) has a SUCCESS_REDIRECT. It is possible that a multi-entry parseMap has an parse element with a SUCCESS_REDIRECT that is not the first element. (perhaps we can first check if parseMap.get(originalUrl) returns a parse, if not use first element of parseMap.values()? ) 2) To be able to pass fetch time to not-actually-fetched-but-generated-in-parse urls, I first put the original fetch time to content and then pass the value in content to all elements in parseMap.values(). I guess this approach is not very optimal since it passes fetch time around a lot. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471857 ] Dogacan Güney commented on NUTCH-443: - nutch.newbie: I fail to see what the problem is. If feedparser doesn't work for you, Nutch has a very powerful plugin api. Just write another plugin that uses Rome or whatever. If you are willing to share it, post it to JIRA explaining why your plugin is better than the current one. Unless there is a license-related problem, I am sure that nutch developers will put it in. PS: I actually have a half-baked plugin that uses Rome, and I will work on rss index and rss query plugins once this issue is resolved. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Richardet updated NUTCH-443: --- Attachment: NUTCH-443-draft-v4.patch Hi Dogacan, Thanks for merging the patches, good teamwork! I worked on the RSS parser, it should now basically work. Now, all core and plugin-tests pass, except for TestRSSparser, will work on that. Once this is in place, I will have a look at the other issues with fetch time, etc. I merged my changes with your patch, version 3. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471878 ] Renaud Richardet commented on NUTCH-443: Nutch Newbie, Gal, Chris It's great that you discuss alternative RSS parsing libraries, bug the resolution of this bug does not depends on which underlying RSS library is used in RSSParser. Would you mind moving the conversation to the new issue I created for it (NUTCH-444), thanks a bunch. allow parsers to return multiple Parse object, this will speed up the rss parser Key: NUTCH-443 URL: https://issues.apache.org/jira/browse/NUTCH-443 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff allow Parser#parse to return a MapString,Parse. This way, the RSS parser can return multiple parse objects, that will all be indexed separately. Advantage: no need to fetch all feed-items separately. see the discussion at http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471952 ] nutch.newbie commented on NUTCH-444: Renaud : Thanks for moving the discussion here. First to answer your question yes its based on mime type detectation problem. The goal of the trial was to see if one could make just a feed search site i.e just feeds but I didn't succeed. I will give it a go over the weekend. Dogcan: Yes, one could just replace the feedparser with rome or stax and submit back here or use it internally. My discussion point was to see how others see about it and maybe there are others who have ran into problem and their experience. As Gal pointed out about rome (At least it is being further developed) and stax and you pointed out that you are doing something with rome.. I just wanted to know what other think and their experience thats all. Yes you are correct i posted it in the wrong forum nutch-443. But Nutch-443 started off as someone having trouble with RSS and it is important in my view to discuss the issue as we are using (feedparser) which is not going to solve the original issue if one tries to create just a RSS search engine. Nutch -443 would have not surfaced in the first place. I am looking forward to that day when I can use nutch just to do rss feed search engine so Dogcan I am very interested in your rome impl. maybe you can post the code here so that i can participate. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Priority: Minor Fix For: 0.9.0 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.