Re: Issues pending before 0.9 release
rubdabadub wrote: On 3/22/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: rubdabadub wrote: > Hi: > > Just wondering about NUTCH-61 > > http://issues.apache.org/jira/browse/Nutch-61 > > Will it make the 0.9 cut? > > It would be nice if it did. Its probably too late. This was discussed before - it will be applied right after the release. Hello Andrzej: Please provide some kind love to Nutch-61 :-) :) Yes, that's the next thing I'm going to do whenever I get some free time. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
On 3/22/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: rubdabadub wrote: > Hi: > > Just wondering about NUTCH-61 > > http://issues.apache.org/jira/browse/Nutch-61 > > Will it make the 0.9 cut? > > It would be nice if it did. Its probably too late. This was discussed before - it will be applied right after the release. Hello Andrzej: Please provide some kind love to Nutch-61 :-) I would be very usefull. Thank you for your kind attention. Regards Rajesh. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
I worked through this swf issue a little more and it seems that java 6 parses out the content differently than java 5. My guess is that it is some type of collection change from 5 to 6 because it looks like only the ordering of the elements is different. Dennis Kubes Sample Help javascript:openCrosslinkWindow('/go/adobeacquisition') Macromedia Home /go/gnav_search?loc=en_us MovieClip solutions /go/gnav_showcase _sans rollOut To ensure the best possible Internet Experience, please download the latest version of the free /go/gnav_store International Products devnet en_us /go/gnav_products AppleGothic Macromedia Flash Player active products String Store downloads rollOver Adobe Home /go/gnav_your_account /go/gnav_downloads Showcase bluePill /go/gnav_company /go/gnav_support /go/gnav_help javascript:openCrosslinkWindow('/go/gnav_adobe_home') home Home Array /go/gnav_fl_minmessage textColor Developers Support color support showcase button /go/gnav_mm_home tabHolder selected Solutions LocaleManager Verdana /go/gnav_devnet Acquisition Info /go/gnav_cart Company /go/gnav_solutions company Downloads TextFormat Java 6 tabHolder LocaleManager Downloads /go/gnav_mm_home AppleGothic downloads MovieClip Acquisition Info rollOut _sans Home String active Macromedia Home Store /go/gnav_company /go/gnav_products color javascript:openCrosslinkWindow('/go/gnav_adobe_home') Adobe Home button support home javascript:openCrosslinkWindow('/go/adobeacquisition') products /go/gnav_store /go/gnav_your_account Help selected /go/gnav_help bluePill Macromedia Flash Player Array en_us Solutions International /go/gnav_solutions TextFormat /go/gnav_search?loc=en_us Company /go/gnav_showcase To ensure the best possible Internet Experience, please download the latest version of the free /go/gnav_cart /go/gnav_devnet rollOver textColor devnet /go/gnav_support Products solutions Developers Verdana Showcase /go/gnav_fl_minmessage company /go/gnav_downloads Support showcase Andrzej Bialecki wrote: Dennis Kubes wrote: I did an update, clean, and test and go no errors. BUILD SUCCESSFUL Total time: 6 minutes It seems this is related to JDK 1.6 - when I switched back to 1.5 all tests passed successfully, switching again to 1.6 causes the parse-swf test to fail. I'm not sure what is the reason - it seems that the results of text extraction are completely different under 1.6 ...
Re: Issues pending before 0.9 release
Sami Siren wrote: Let's make it the best release ever! :) I have a good feeling about this one. There's some nice marketing material about crawling efficiency [1]. I should probably extend benching to indexing and searching too. [1] http://blog.foofactory.fi/2007/03/twice-speed-half-size.html Yes, I saw this - great stuff :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
Andrzej Bialecki wrote: > I can't figure out what's wrong with the SWF parser when used with JDK > 1.6, it works just fine with 1.5 .. However, I propose to add a release > note somewhere that warns about this, and move on with the release anyway. +1 A jira issue is probably enough. > If there are no further issues (anyone?), we could start the release > process on Monday, and until then run as many tests as possible. +1 > Let's make it the best release ever! :) I have a good feeling about this one. There's some nice marketing material about crawling efficiency [1]. I should probably extend benching to indexing and searching too. [1] http://blog.foofactory.fi/2007/03/twice-speed-half-size.html -- Sami Siren
Re: Issues pending before 0.9 release
Dennis Kubes wrote I did an update, clean, and test and go no errors. BUILD SUCCESSFUL Total time: 6 minutes I can't figure out what's wrong with the SWF parser when used with JDK 1.6, it works just fine with 1.5 .. However, I propose to add a release note somewhere that warns about this, and move on with the release anyway. I upgraded to Hadoop 0.12.2, since it contained some important stability fixes. All tests pass. If there are no further issues (anyone?), we could start the release process on Monday, and until then run as many tests as possible. Let's make it the best release ever! :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
Dennis Kubes wrote: I did an update, clean, and test and go no errors. BUILD SUCCESSFUL Total time: 6 minutes It seems this is related to JDK 1.6 - when I switched back to 1.5 all tests passed successfully, switching again to 1.6 causes the parse-swf test to fail. I'm not sure what is the reason - it seems that the results of text extraction are completely different under 1.6 ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
rubdabadub wrote: Hi: Just wondering about NUTCH-61 http://issues.apache.org/jira/browse/Nutch-61 Will it make the 0.9 cut? It would be nice if it did. Its probably too late. This was discussed before - it will be applied right after the release. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
I did an update, clean, and test and go no errors. BUILD SUCCESSFUL Total time: 6 minutes Sami Siren wrote: 2007/3/21, Andrzej Bialecki <[EMAIL PROTECTED]>: Sami Siren wrote: > for me it works: > > ... > BUILD SUCCESSFUL > Total time: 4 minutes 3 seconds I did a fresh checkout to an empty dir, rebuilt and it's still failing - perhaps you have some uncommitted changes in your working copy ... ? no, I also did a fresh co from trunk, I'll check it again this evening just in case. -- Sami Siren
Re: Issues pending before 0.9 release
2007/3/21, Andrzej Bialecki <[EMAIL PROTECTED]>: Sami Siren wrote: > for me it works: > > ... > BUILD SUCCESSFUL > Total time: 4 minutes 3 seconds I did a fresh checkout to an empty dir, rebuilt and it's still failing - perhaps you have some uncommitted changes in your working copy ... ? no, I also did a fresh co from trunk, I'll check it again this evening just in case. -- Sami Siren
Re: Issues pending before 0.9 release
Sami Siren wrote: for me it works: ... BUILD SUCCESSFUL Total time: 4 minutes 3 seconds I did a fresh checkout to an empty dir, rebuilt and it's still failing - perhaps you have some uncommitted changes in your working copy ... ? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
for me it works: ... BUILD SUCCESSFUL Total time: 4 minutes 3 seconds -- Sami Siren 2007/3/21, Andrzej Bialecki <[EMAIL PROTECTED]>: Dennis Kubes wrote: > I am good to go as well. Hmm ... Test suite fails for me, with a cryptic message (cryptic because the plugin test itself succeeds): [...] init: init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-regex compile-test: jar: deps-test: init: init-plugin: compile: jar: deps-test: deploy: copy-generated-lib: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-regex [junit] Running org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.016 sec [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.359 sec BUILD FAILED C:\disks\e\work\nutch\vanilla\build.xml:300: The following error occurred while executing this line: C:\disks\e\work\nutch\vanilla\src\plugin\build.xml:99: The following error occurred while executing this line: C:\disks\e\work\nutch\vanilla\src\plugin\build-plugin.xml:200: Tests failed! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
2007/3/21, Andrzej Bialecki <[EMAIL PROTECTED]>: >> Any other stuff we need to fix before the release? > > I am satisfied except the broken bin/nutch. Fixed now - tested both under Cygwin and Fedora. Thanks, I can confirm that it works now :) -- Sami Siren
Re: Issues pending before 0.9 release
Dennis Kubes wrote: I am good to go as well. Hmm ... Test suite fails for me, with a cryptic message (cryptic because the plugin test itself succeeds): [...] init: init-plugin: deps-jar: compile: [echo] Compiling plugin: urlnormalizer-regex compile-test: jar: deps-test: init: init-plugin: compile: jar: deps-test: deploy: copy-generated-lib: deploy: copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-regex [junit] Running org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.016 sec [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.359 sec BUILD FAILED C:\disks\e\work\nutch\vanilla\build.xml:300: The following error occurred while executing this line: C:\disks\e\work\nutch\vanilla\src\plugin\build.xml:99: The following error occurred while executing this line: C:\disks\e\work\nutch\vanilla\src\plugin\build-plugin.xml:200: Tests failed! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
Hi: Just wondering about NUTCH-61 http://issues.apache.org/jira/browse/Nutch-61 Will it make the 0.9 cut? It would be nice if it did. Its probably too late. Regards On 3/21/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: I am good to go as well. Dennis Kubes Andrzej Bialecki wrote: > Sami Siren wrote: >> Andrzej Bialecki wrote: >>> Hi all, >>> >>> I just committed Hadoop 0.12.1. Let's double-check that it works ok. >>> Here's the list of Critical/Blocker issues I mentioned before, and their >>> current status: >>> >>> Any other stuff we need to fix before the release? >> >> I am satisfied except the broken bin/nutch. > > Fixed now - tested both under Cygwin and Fedora. >
Re: Issues pending before 0.9 release
I am good to go as well. Dennis Kubes Andrzej Bialecki wrote: Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except the broken bin/nutch. Fixed now - tested both under Cygwin and Fedora.
Re: Issues pending before 0.9 release
Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff we need to fix before the release? I am satisfied except the broken bin/nutch. Fixed now - tested both under Cygwin and Fedora. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
Andrzej Bialecki wrote: > Hi all, > > I just committed Hadoop 0.12.1. Let's double-check that it works ok. > Here's the list of Critical/Blocker issues I mentioned before, and their > current status: > > Any other stuff we need to fix before the release? I am satisfied except the broken bin/nutch. -- Sami Siren
Re: Issues pending before 0.9 release
Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: NUTCH-400 Fixed. NUTCH-353 Moved to Major, fix after release. NUTCH-233 Fixed. NUTCH-436 Fixed. NUTCH-427 Moved to Major, fix after release. NUTCH-381 Won't fix - this is a configuration issue. NUTCH-277 Cannot reproduce NUTCH-167 Fixed. Any other stuff we need to fix before the release? -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
Sami Siren wrote: It would be more beneficial to everybody if the discussions (related to release or Nutch) is done on public (hey this is open source!). The off the list stuff IMO smells. +1 Folks sometimes wish to discuss project matters off-list to spare others the boring details, but this is usually a bad idea. All project decisions should be made in public on this list. Discussions relevant to these decisions are also thus best made on this list, since they explain the decision. Private discussions are permissible to develop a proposal, but that is usually better done on-list when possible, so that others can get involved earlier. (The one notable exception is that personnel issues are discussed on the private PMC list.) Doug
Re: Issues pending before 0.9 release
P.S. I am going to contact Pitor and coordinate with him: I'd like to be the release manager for this Nutch release. It would be more beneficial to everybody if the discussions (related to release or Nutch) is done on public (hey this is open source!). The off the list stuff IMO smells. -- Sami Siren
Re: Issues pending before 0.9 release
Chris Mattmann wrote: P.S. I am going to contact Pitor and coordinate with him: I'd like to be the release manager for this Nutch release. Everyone heard that? :) That's cool, thanks! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
Chris Mattmann wrote: Hi Guys, Blocker * NUTCH-400 (Update & add missing license headers) - I believe this is fixed and should be closed +1, thanks to Sami for closing it. * NUTCH-353 (pages that serverside forwards will be refetched every time) - this was partially fixed in NUTCH-273, but a more complete solution would require significant changes to LinkDb. As there are no patches implementing this, I left it open, but it's no longer as critical as it was before. I propose to move it to "Major" and address it in the next release. +1 * NUTCH-233 (wrong regular expression hang reduce process for ever) - I propose to apply the fix provided by Sean Dean and close this issue for now. +1 Critical * NUTCH-436 (Incorrect handling of relative paths when the embedded URL path is empty). There is no patch available yet. If someone could contribute a patch I'd like to see this fixed before the release. Looks like Dennis is on this one * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's certainly not critical (as this is an optional new feature). I propose to change it to Major, and make a decision - do we want another plugin like parse-mp3 or parse-rtf, or not. Let's hold off on this: it's not necessary for 0.9, and I don't think there's been a bunch of traffic on the list identifying this as critical to get into the sources for the release * NUTCH-381 (Ignore external link not work as expected) - I'll try to reproduce it, and if I find an easy fix I'd like to apply it before the release. +1 * NUTCH-277 (Fetcher dies because of "max. redirects") - I wasn't able to reproduce it. If there is no updated information on this I propose to close it with "Can't reproduce". +1, I had to do something similar with NUTCH-258 * NUTCH-167 (Observation of ) - there's a patch which I tested in a limited production env. If there are no objections I'd like to apply it before the release. +1 Major = There are 84 major issues, but some of them are either invalid, or should be "minor", or no longer apply and should be closed. Please review them if you can and provide some comments or recommendations if you think you have some new information. I will spend some time going through JIRA today and see if there's any issues that I can find that: 1. Have a patch already 2. Sound like something quick, easy, and not so far-reaching across the entire Nutch API One decision also that we need to make is which version of Hadoop should be included in the release. Current trunk uses 0.10.1, I have a set of production-tested patches that use 0.11.2, and today the Hadoop team released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time before our release). The most conservative option is to stay with 0.10.1, but by the time people start using Nutch this will be a fairly old version already. I propose to upgrade to 0.11.2. We could use 0.12.1 - but in this case with the expectation that we release less than stable version of Nutch to be soon followed by a minor stable release ... I'd agree with the upgrade to 0.11.2, +1 Cheers, Chris P.S. I am going to contact Pitor and coordinate with him: I'd like to be the release manager for this Nutch release. I would like to help with this as well, even if it is just watching how the process works this time. Dennis
Re: Issues pending before 0.9 release
Hi Guys, > Blocker > > * NUTCH-400 (Update & add missing license headers) - I believe this is > fixed and should be closed +1, thanks to Sami for closing it. > > * NUTCH-353 (pages that serverside forwards will be refetched every > time) - this was partially fixed in NUTCH-273, but a more complete > solution would require significant changes to LinkDb. As there are no > patches implementing this, I left it open, but it's no longer as > critical as it was before. I propose to move it to "Major" and address > it in the next release. +1 > > * NUTCH-233 (wrong regular expression hang reduce process for ever) - I > propose to apply the fix provided by Sean Dean and close this issue for now. +1 > > Critical > > * NUTCH-436 (Incorrect handling of relative paths when the embedded URL > path is empty). There is no patch available yet. If someone could > contribute a patch I'd like to see this fixed before the release. Looks like Dennis is on this one > > * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's > certainly not critical (as this is an optional new feature). I propose > to change it to Major, and make a decision - do we want another plugin > like parse-mp3 or parse-rtf, or not. Let's hold off on this: it's not necessary for 0.9, and I don't think there's been a bunch of traffic on the list identifying this as critical to get into the sources for the release > > * NUTCH-381 (Ignore external link not work as expected) - I'll try to > reproduce it, and if I find an easy fix I'd like to apply it before the > release. +1 > > * NUTCH-277 (Fetcher dies because of "max. redirects") - I wasn't able > to reproduce it. If there is no updated information on this I propose to > close it with "Can't reproduce". +1, I had to do something similar with NUTCH-258 > > * NUTCH-167 (Observation of ) - > there's a patch which I tested in a limited production env. If there are > no objections I'd like to apply it before the release. +1 > > Major > = > There are 84 major issues, but some of them are either invalid, or > should be "minor", or no longer apply and should be closed. Please > review them if you can and provide some comments or recommendations if > you think you have some new information. I will spend some time going through JIRA today and see if there's any issues that I can find that: 1. Have a patch already 2. Sound like something quick, easy, and not so far-reaching across the entire Nutch API > > > One decision also that we need to make is which version of Hadoop should > be included in the release. Current trunk uses 0.10.1, I have a set of > production-tested patches that use 0.11.2, and today the Hadoop team > released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time > before our release). The most conservative option is to stay with > 0.10.1, but by the time people start using Nutch this will be a fairly > old version already. I propose to upgrade to 0.11.2. We could use 0.12.1 > - but in this case with the expectation that we release less than stable > version of Nutch to be soon followed by a minor stable release ... I'd agree with the upgrade to 0.11.2, +1 Cheers, Chris P.S. I am going to contact Pitor and coordinate with him: I'd like to be the release manager for this Nutch release.
Re: Issues pending before 0.9 release
NUTCH-436 has a patch now if we want to add that to this release. Dennis Kubes Andrzej Bialecki wrote: Sean Dean wrote: As for which Hadoop version is included in the next Nutch release, I share the same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I can volunteer to test any other version we are interested in, my regular fetches are about 13 million URLs and take a couple days to complete. If anyone has a specific Hadoop jar they would like to share I don't mind testing it, otherwise I can just build the "most popular" version from source and replace that with my current one. For the record, I've been using Hadoop 0.9.1 for the longest time without any problems on these somewhat large crawls. It's clear to me then that we should bring Nutch to 0.11.2 first anyway. Then, if we have time and if you are willing, we could test the 0.12 and if it's stable enough for your 13 mln crawl then it's likely it's good enough for the rest of us. If there are no dissenting votes, I'll apply the patch to bring in 0.11.2 some time tomorrow. I will also create a JIRA issue and attach the patches from that revision to Hadoop 0.12 so that folks may test them. Thanks for your comments!
Re: Issues pending before 0.9 release
Sean Dean wrote: As for which Hadoop version is included in the next Nutch release, I share the same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I can volunteer to test any other version we are interested in, my regular fetches are about 13 million URLs and take a couple days to complete. If anyone has a specific Hadoop jar they would like to share I don't mind testing it, otherwise I can just build the "most popular" version from source and replace that with my current one. For the record, I've been using Hadoop 0.9.1 for the longest time without any problems on these somewhat large crawls. It's clear to me then that we should bring Nutch to 0.11.2 first anyway. Then, if we have time and if you are willing, we could test the 0.12 and if it's stable enough for your 13 mln crawl then it's likely it's good enough for the rest of us. If there are no dissenting votes, I'll apply the patch to bring in 0.11.2 some time tomorrow. I will also create a JIRA issue and attach the patches from that revision to Hadoop 0.12 so that folks may test them. Thanks for your comments! -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Issues pending before 0.9 release
As for which Hadoop version is included in the next Nutch release, I share the same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I can volunteer to test any other version we are interested in, my regular fetches are about 13 million URLs and take a couple days to complete. If anyone has a specific Hadoop jar they would like to share I don't mind testing it, otherwise I can just build the "most popular" version from source and replace that with my current one. For the record, I've been using Hadoop 0.9.1 for the longest time without any problems on these somewhat large crawls. - Original Message From: Sami Siren <[EMAIL PROTECTED]> To: nutch-dev@lucene.apache.org Sent: Sunday, March 4, 2007 1:50:23 AM Subject: Re: Issues pending before 0.9 release Andrzej Bialecki wrote: > Hi all, > > The following issues need to be discussed and appropriate action taken > before the 0.9 release: > > Blocker > > * NUTCH-400 (Update & add missing license headers) - I believe this is > fixed and should be closed I agree. I should close it. > * NUTCH-233 (wrong regular expression hang reduce process for ever) - I > propose to apply the fix provided by Sean Dean and close this issue for > now. yes that was the resolution also last time :) > * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's > certainly not critical (as this is an optional new feature). I propose > to change it to Major, and make a decision - do we want another plugin > like parse-mp3 or parse-rtf, or not. One option would be setting up a separate project outside Apache to host and maintain these and remove the remaining torsos from Nutch source base. > One decision also that we need to make is which version of Hadoop should > be included in the release. Current trunk uses 0.10.1, I have a set of > production-tested patches that use 0.11.2, and today the Hadoop team > released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time > before our release). The most conservative option is to stay with > 0.10.1, but by the time people start using Nutch this will be a fairly 0.10.1 is not an option, there is that NPE in sorting that is does not allow any crawling beyond modes sizes (HADOOP-917). We should upgrade hadoop to 0.11.2 or 0.12.0 and gather experiences from running it on reasonable sized crawls, so my suggestion is that don't decide this on paper. -- Sami Siren
Re: Issues pending before 0.9 release
Andrzej Bialecki wrote: > Hi all, > > The following issues need to be discussed and appropriate action taken > before the 0.9 release: > > Blocker > > * NUTCH-400 (Update & add missing license headers) - I believe this is > fixed and should be closed I agree. I should close it. > * NUTCH-233 (wrong regular expression hang reduce process for ever) - I > propose to apply the fix provided by Sean Dean and close this issue for > now. yes that was the resolution also last time :) > * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's > certainly not critical (as this is an optional new feature). I propose > to change it to Major, and make a decision - do we want another plugin > like parse-mp3 or parse-rtf, or not. One option would be setting up a separate project outside Apache to host and maintain these and remove the remaining torsos from Nutch source base. > One decision also that we need to make is which version of Hadoop should > be included in the release. Current trunk uses 0.10.1, I have a set of > production-tested patches that use 0.11.2, and today the Hadoop team > released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time > before our release). The most conservative option is to stay with > 0.10.1, but by the time people start using Nutch this will be a fairly 0.10.1 is not an option, there is that NPE in sorting that is does not allow any crawling beyond modes sizes (HADOOP-917). We should upgrade hadoop to 0.11.2 or 0.12.0 and gather experiences from running it on reasonable sized crawls, so my suggestion is that don't decide this on paper. -- Sami Siren
Re: Issues pending before 0.9 release
> Hi all, > > The following issues need to be discussed and appropriate action taken > before the 0.9 release: > > Blocker > > * NUTCH-400 (Update & add missing license headers) - I believe this is > fixed and should be closed > > * NUTCH-353 (pages that serverside forwards will be refetched every > time) - this was partially fixed in NUTCH-273, but a more complete > solution would require significant changes to LinkDb. As there are no > patches implementing this, I left it open, but it's no longer as > critical as it was before. I propose to move it to "Major" and address > it in the next release. > > * NUTCH-233 (wrong regular expression hang reduce process for ever) - I > propose to apply the fix provided by Sean Dean and close this issue for > now. > > Critical > > * NUTCH-436 (Incorrect handling of relative paths when the embedded URL > path is empty). There is no patch available yet. If someone could > contribute a patch I'd like to see this fixed before the release. I am starting to take a look at this. I will try to get it fixed before we release. > > * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's > certainly not critical (as this is an optional new feature). I propose > to change it to Major, and make a decision - do we want another plugin > like parse-mp3 or parse-rtf, or not. > > * NUTCH-381 (Ignore external link not work as expected) - I'll try to > reproduce it, and if I find an easy fix I'd like to apply it before the > release. > > * NUTCH-277 (Fetcher dies because of "max. redirects") - I wasn't able > to reproduce it. If there is no updated information on this I propose to > close it with "Can't reproduce". > > * NUTCH-167 (Observation of ) - > there's a patch which I tested in a limited production env. If there are > no objections I'd like to apply it before the release. > > Major > = > There are 84 major issues, but some of them are either invalid, or > should be "minor", or no longer apply and should be closed. Please > review them if you can and provide some comments or recommendations if > you think you have some new information. > > > One decision also that we need to make is which version of Hadoop should > be included in the release. Current trunk uses 0.10.1, I have a set of > production-tested patches that use 0.11.2, and today the Hadoop team > released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time > before our release). The most conservative option is to stay with > 0.10.1, but by the time people start using Nutch this will be a fairly > old version already. I propose to upgrade to 0.11.2. We could use 0.12.1 > - but in this case with the expectation that we release less than stable > version of Nutch to be soon followed by a minor stable release ... +1 for using 0.11.2. I looked through the release notes for 0.12 and there were some niceties such as HADOOP-432 for undeletes and alot of bug fixes, but it didn't look like there were any critical issues as far as Nutch is concerned. Dennis Kubes > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > >
Issues pending before 0.9 release
Hi all, The following issues need to be discussed and appropriate action taken before the 0.9 release: Blocker * NUTCH-400 (Update & add missing license headers) - I believe this is fixed and should be closed * NUTCH-353 (pages that serverside forwards will be refetched every time) - this was partially fixed in NUTCH-273, but a more complete solution would require significant changes to LinkDb. As there are no patches implementing this, I left it open, but it's no longer as critical as it was before. I propose to move it to "Major" and address it in the next release. * NUTCH-233 (wrong regular expression hang reduce process for ever) - I propose to apply the fix provided by Sean Dean and close this issue for now. Critical * NUTCH-436 (Incorrect handling of relative paths when the embedded URL path is empty). There is no patch available yet. If someone could contribute a patch I'd like to see this fixed before the release. * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's certainly not critical (as this is an optional new feature). I propose to change it to Major, and make a decision - do we want another plugin like parse-mp3 or parse-rtf, or not. * NUTCH-381 (Ignore external link not work as expected) - I'll try to reproduce it, and if I find an easy fix I'd like to apply it before the release. * NUTCH-277 (Fetcher dies because of "max. redirects") - I wasn't able to reproduce it. If there is no updated information on this I propose to close it with "Can't reproduce". * NUTCH-167 (Observation of ) - there's a patch which I tested in a limited production env. If there are no objections I'd like to apply it before the release. Major = There are 84 major issues, but some of them are either invalid, or should be "minor", or no longer apply and should be closed. Please review them if you can and provide some comments or recommendations if you think you have some new information. One decision also that we need to make is which version of Hadoop should be included in the release. Current trunk uses 0.10.1, I have a set of production-tested patches that use 0.11.2, and today the Hadoop team released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time before our release). The most conservative option is to stay with 0.10.1, but by the time people start using Nutch this will be a fairly old version already. I propose to upgrade to 0.11.2. We could use 0.12.1 - but in this case with the expectation that we release less than stable version of Nutch to be soon followed by a minor stable release ... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com