[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542819 ] Renaud Richardet commented on NUTCH-444: hi, i am travelling and will be offline until january 2008. thanks for your patience. Renaud bonjour, je suis en voyage et ne serai pas atteignable par mail avant janvier 2008. merci de votre patience. Renaud -- renaudatoslutionsdotcom www.oslutions.com Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542814 ] musepwizard edited comment on NUTCH-444 at 11/15/07 9:08 AM: -- Fixes errors in unit test on windows machines. This is a trivial change so I went ahead and comitted it. was (Author: musepwizard): Fixes errors in unit test on windows machines. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Closed: (NUTCH-552) Upgrade Nutch to Hadoop 0.15.x
[ https://issues.apache.org/jira/browse/NUTCH-552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-552. -- Resolution: Fixed This has now been fixed and comitted. Upgrade Nutch to Hadoop 0.15.x -- Key: NUTCH-552 URL: https://issues.apache.org/jira/browse/NUTCH-552 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Andrzej Bialecki Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-552-1.patch, NUTCH-552-2.patch, NUTCH-552-3.patch, NUTCH-552-4.patch, NUTCH-552.1-1.patch Upgrade Nutch to Hadoop 0.15.x . -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-444: --- Attachment: NUTCH-444.1-1.patch Fixes errors in unit test on windows machines. Possibly use a different library to parse RSS feed for improved performance and compatibility - Key: NUTCH-444 URL: https://issues.apache.org/jira/browse/NUTCH-444 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Renaud Richardet Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.0.0 Attachments: feed.tar.bz2, NUTCH-444.1-1.patch, NUTCH-444.Mattmann.061707.patch.txt, NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2 As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current library (feedparser) has the following issues: - OutOfMemory when parsing 100k feeds, since it has to convert the feed to jdom first - no support for Atom 1.0 - there has been no development in the last year Alternatives are: - Rome - Informa - custom implementation based on Stax - ?? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Commit Times for Issues
So I have been talking with some of the other committers and I wanted to layout a suggestion for standardizing some of the nutch committer workflow processes in the hope of speeding up nutch development. The first one I was hoping to tackle is time to commit. At least for me it has been hard to know when to commit something, especially when it was trivial or no one commented on the issue. Here is what is being proposed: Trivial changes = immediate, this at the discretion of the committers Minor changes = 24 hours from latest patch or 1 or more +1 from committers Major and blocker changes = 4 days from latest patch or 2 or more +1 from committers This way if an issue has been active for some time but no one has taken a look at it, and it has passed all unit tests, then we can go ahead and commit it. Also this should allow more of the smaller changes to be handled faster. So these of course are just some suggestions would love to hear from others in the community. What I think would be best is to come to a consensus on this and then have a wiki page describing this and other processes for committers. Dennis Kubes
Re: Commit Times for Issues
Dennis Kubes wrote: So I have been talking with some of the other committers and I wanted to layout a suggestion for standardizing some of the nutch committer workflow processes in the hope of speeding up nutch development. The first one I was hoping to tackle is time to commit. At least for me it has been hard to know when to commit something, especially when it was trivial or no one commented on the issue. Here is what is being proposed: Trivial changes = immediate, this at the discretion of the committers Minor changes = 24 hours from latest patch or 1 or more +1 from committers Major and blocker changes = 4 days from latest patch or 2 or more +1 from committers This way if an issue has been active for some time but no one has taken a look at it, and it has passed all unit tests, then we can go ahead and commit it. Also this should allow more of the smaller changes to be handled faster. So these of course are just some suggestions would love to hear from others in the community. What I think would be best is to come to a consensus on this and then have a wiki page describing this and other processes for committers. I agree with the overall plan - we need to speed up the process and release the committers from worrying too much whether a patch is ripe enough to commit it. Though I think that in case of minor changes, the 24 hours period is too short. By definition, since they are not trivial then it means they could use a peer review. Sometimes it's difficult to get a patch reviewed within 24 hours, and in the coding enthusiasm it's easy to be too quick ... I'd say 48 hours if no review, or less if the patch is reviewed and gets +1. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Nutch trunk js-parser problem with extremely long and meaningless Elements
I've run into the problem before that, while running the parser, it gets caught in really deep regex loops. For a quick fix I changed urlfilter-prefix to not allow urls over 300 characters and to make sure none of the characters have ascii values 32 (control characters). I just ran into another one today but it's in the js parser. Take a look at the source for http://www.magic-cadeaux.fr/ when it lists the function swap(image, num). If it weren't for all of the slashes then it is well formed javascript, but unfortunately the parse-js plugin doesn't deal with it correctly. It just hangs in a very very deep loop. A browser, such as firefox, however seems to deal with it okay. Is there a way we can deal with these cases rather than limiting the size of an Element?
about heritrix crawl,Who will tell me in this Nutch forum?thanks
A.3. Mirroring .html Files Only in http://crawler.archive.org/articles/user_manual/usecases.html .. On the Setting screen, i'll want to set the following for the NotMatchesFilePatternDecideRule: decision: REJECT use-preset-pattern: CUSTOM regexp: .*(/|\.html)$ .. How to config above in Submodules of Heritrix ?I do't know.anyone help me.Thanks -- View this message in context: http://www.nabble.com/about-heritrix-crawl%2CWho-will-tell-me-in-this-Nutch-forum-thanks-tf4819146.html#a13787379 Sent from the Nutch - Dev mailing list archive at Nabble.com.