Re: Issues pending before 0.9 release

2007-05-18 Thread Andrzej Bialecki

rubdabadub wrote:

On 3/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

rubdabadub wrote:
 Hi:

 Just wondering about NUTCH-61

 http://issues.apache.org/jira/browse/Nutch-61

 Will it make the 0.9 cut?

 It would be nice if it did. Its probably too late.

This was discussed before - it will be applied right after the release.


Hello Andrzej:

Please provide some kind love to Nutch-61 :-)


:) Yes, that's the next thing I'm going to do whenever I get some free time.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-05-16 Thread rubdabadub

On 3/22/07, Andrzej Bialecki [EMAIL PROTECTED] wrote:

rubdabadub wrote:
 Hi:

 Just wondering about NUTCH-61

 http://issues.apache.org/jira/browse/Nutch-61

 Will it make the 0.9 cut?

 It would be nice if it did. Its probably too late.

This was discussed before - it will be applied right after the release.


Hello Andrzej:

Please provide some kind love to Nutch-61 :-)

I would be very usefull. Thank you for your kind attention.

Regards
Rajesh.


--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Issues pending before 0.9 release

2007-03-25 Thread Dennis Kubes
I worked through this swf issue a little more and it seems that java 6 
parses out the content differently than java 5.  My guess is that it is 
some type of collection change from 5 to 6 because it looks like only 
the ordering of the elements is different.


Dennis Kubes

Sample
 Help javascript:openCrosslinkWindow('/go/adobeacquisition') 
Macromedia Home /go/gnav_search?loc=en_us MovieClip solutions 
/go/gnav_showcase _sans rollOut To ensure the best possible Internet 
Experience, please download the latest version of the free 
/go/gnav_store International Products devnet en_us /go/gnav_products 
AppleGothic Macromedia Flash Player active products String Store 
downloads rollOver Adobe Home /go/gnav_your_account /go/gnav_downloads 
Showcase bluePill /go/gnav_company /go/gnav_support /go/gnav_help 
javascript:openCrosslinkWindow('/go/gnav_adobe_home') home Home Array 
/go/gnav_fl_minmessage textColor Developers Support color support 
showcase button /go/gnav_mm_home tabHolder selected Solutions 
LocaleManager Verdana /go/gnav_devnet Acquisition Info /go/gnav_cart 
Company /go/gnav_solutions company Downloads TextFormat


Java 6
 tabHolder LocaleManager Downloads /go/gnav_mm_home AppleGothic 
downloads MovieClip Acquisition Info rollOut _sans Home String active 
Macromedia Home Store /go/gnav_company /go/gnav_products color 
javascript:openCrosslinkWindow('/go/gnav_adobe_home') Adobe Home button 
support home javascript:openCrosslinkWindow('/go/adobeacquisition') 
products /go/gnav_store /go/gnav_your_account Help selected 
/go/gnav_help bluePill Macromedia Flash Player Array en_us Solutions 
International /go/gnav_solutions TextFormat /go/gnav_search?loc=en_us 
Company /go/gnav_showcase To ensure the best possible Internet 
Experience, please download the latest version of the free /go/gnav_cart 
/go/gnav_devnet rollOver textColor devnet /go/gnav_support Products 
solutions Developers Verdana Showcase /go/gnav_fl_minmessage company 
/go/gnav_downloads Support showcase



Andrzej Bialecki wrote:

Dennis Kubes wrote:

I did an update, clean, and test and go no errors.

BUILD SUCCESSFUL
Total time: 6 minutes


It seems this is related to JDK 1.6 - when I switched back to 1.5 all 
tests passed successfully, switching again to 1.6 causes the parse-swf 
test to fail. I'm not sure what is the reason - it seems that the 
results of text extraction are completely different under 1.6 ...




Re: Issues pending before 0.9 release

2007-03-24 Thread Andrzej Bialecki

Sami Siren wrote:



Let's make it the best release ever! :)


I have a good feeling about this one. There's some nice marketing
material about crawling efficiency [1]. I should probably extend
benching to indexing and searching too.

[1] http://blog.foofactory.fi/2007/03/twice-speed-half-size.html


Yes, I saw this - great stuff :)


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-23 Thread Andrzej Bialecki

Dennis Kubes wrote

I did an update, clean, and test and go no errors.

BUILD SUCCESSFUL
Total time: 6 minutes


I can't figure out what's wrong with the SWF parser when used with JDK 
1.6, it works just fine with 1.5 .. However, I propose to add a release 
note somewhere that warns about this, and move on with the release anyway.


I upgraded to Hadoop 0.12.2, since it contained some important stability 
fixes. All tests pass.


If there are no further issues (anyone?), we could start the release 
process on Monday, and until then run as many tests as possible.


Let's make it the best release ever! :)


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-23 Thread Sami Siren
Andrzej Bialecki wrote:

 I can't figure out what's wrong with the SWF parser when used with JDK
 1.6, it works just fine with 1.5 .. However, I propose to add a release
 note somewhere that warns about this, and move on with the release anyway.

+1

A jira issue is probably enough.

 If there are no further issues (anyone?), we could start the release
 process on Monday, and until then run as many tests as possible.

+1

 Let's make it the best release ever! :)

I have a good feeling about this one. There's some nice marketing
material about crawling efficiency [1]. I should probably extend
benching to indexing and searching too.

[1] http://blog.foofactory.fi/2007/03/twice-speed-half-size.html

--
 Sami Siren


Re: Issues pending before 0.9 release

2007-03-21 Thread rubdabadub

Hi:

Just wondering about NUTCH-61

http://issues.apache.org/jira/browse/Nutch-61

Will it make the 0.9 cut?

It would be nice if it did. Its probably too late.

Regards

On 3/21/07, Dennis Kubes [EMAIL PROTECTED] wrote:

I am good to go as well.

Dennis Kubes

Andrzej Bialecki wrote:
 Sami Siren wrote:
 Andrzej Bialecki wrote:
 Hi all,

 I just committed Hadoop 0.12.1. Let's double-check that it works ok.
 Here's the list of Critical/Blocker issues I mentioned before, and their
 current status:

 Any other stuff we need to fix before the release?

 I am satisfied except the broken bin/nutch.

 Fixed now - tested both under Cygwin and Fedora.




Re: Issues pending before 0.9 release

2007-03-21 Thread Sami Siren

2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:


 Any other stuff we need to fix before the release?

 I am satisfied except the broken bin/nutch.

Fixed now - tested both under Cygwin and Fedora.

Thanks, I can confirm that it works now :)


--
Sami Siren


Re: Issues pending before 0.9 release

2007-03-21 Thread Sami Siren

for me it works:

...
BUILD SUCCESSFUL
Total time: 4 minutes 3 seconds

--
Sami Siren

2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:


Dennis Kubes wrote:
 I am good to go as well.

Hmm ... Test suite fails for me, with a cryptic message (cryptic because
the plugin test itself succeeds):

[...]
init:

init-plugin:

deps-jar:

compile:
  [echo] Compiling plugin: urlnormalizer-regex

compile-test:

jar:

deps-test:

init:

init-plugin:

compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

test:
  [echo] Testing plugin: urlnormalizer-regex
 [junit] Running
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
 [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.016 sec
 [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 8.359 sec

BUILD FAILED
C:\disks\e\work\nutch\vanilla\build.xml:300: The following error
occurred while executing this line:
C:\disks\e\work\nutch\vanilla\src\plugin\build.xml:99: The following
error occurred while executing this line:
C:\disks\e\work\nutch\vanilla\src\plugin\build-plugin.xml:200: Tests
failed!



--
Best regards,
Andrzej Bialecki 
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki

Sami Siren wrote:

for me it works:

...
BUILD SUCCESSFUL
Total time: 4 minutes 3 seconds


I did a fresh checkout to an empty dir, rebuilt and it's still failing - 
perhaps you have some uncommitted changes in your working copy ... ?



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-21 Thread Sami Siren

2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:


Sami Siren wrote:
 for me it works:

 ...
 BUILD SUCCESSFUL
 Total time: 4 minutes 3 seconds

I did a fresh checkout to an empty dir, rebuilt and it's still failing -
perhaps you have some uncommitted changes in your working copy ... ?



no, I also did a fresh co from trunk, I'll check it again this evening just
in case.

--
Sami Siren


Re: Issues pending before 0.9 release

2007-03-21 Thread Dennis Kubes

I did an update, clean, and test and go no errors.

BUILD SUCCESSFUL
Total time: 6 minutes

Sami Siren wrote:

2007/3/21, Andrzej Bialecki [EMAIL PROTECTED]:


Sami Siren wrote:
 for me it works:

 ...
 BUILD SUCCESSFUL
 Total time: 4 minutes 3 seconds

I did a fresh checkout to an empty dir, rebuilt and it's still failing -
perhaps you have some uncommitted changes in your working copy ... ?



no, I also did a fresh co from trunk, I'll check it again this evening just
in case.

--
Sami Siren



Re: Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki

rubdabadub wrote:

Hi:

Just wondering about NUTCH-61

http://issues.apache.org/jira/browse/Nutch-61

Will it make the 0.9 cut?

It would be nice if it did. Its probably too late.


This was discussed before - it will be applied right after the release.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-21 Thread Andrzej Bialecki

Dennis Kubes wrote:

I did an update, clean, and test and go no errors.

BUILD SUCCESSFUL
Total time: 6 minutes


It seems this is related to JDK 1.6 - when I switched back to 1.5 all 
tests passed successfully, switching again to 1.6 causes the parse-swf 
test to fail. I'm not sure what is the reason - it seems that the 
results of text extraction are completely different under 1.6 ...


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-20 Thread Sami Siren
Andrzej Bialecki wrote:
 Hi all,
 
 I just committed Hadoop 0.12.1. Let's double-check that it works ok.
 Here's the list of Critical/Blocker issues I mentioned before, and their
 current status:
 
 Any other stuff we need to fix before the release?

I am satisfied except the broken bin/nutch.

--
 Sami Siren



Re: Issues pending before 0.9 release

2007-03-20 Thread Andrzej Bialecki

Sami Siren wrote:

Andrzej Bialecki wrote:

Hi all,

I just committed Hadoop 0.12.1. Let's double-check that it works ok.
Here's the list of Critical/Blocker issues I mentioned before, and their
current status:

Any other stuff we need to fix before the release?


I am satisfied except the broken bin/nutch.


Fixed now - tested both under Cygwin and Fedora.

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-20 Thread Dennis Kubes

I am good to go as well.

Dennis Kubes

Andrzej Bialecki wrote:

Sami Siren wrote:

Andrzej Bialecki wrote:

Hi all,

I just committed Hadoop 0.12.1. Let's double-check that it works ok.
Here's the list of Critical/Blocker issues I mentioned before, and their
current status:

Any other stuff we need to fix before the release?


I am satisfied except the broken bin/nutch.


Fixed now - tested both under Cygwin and Fedora.



Re: Issues pending before 0.9 release

2007-03-19 Thread Andrzej Bialecki

Hi all,

I just committed Hadoop 0.12.1. Let's double-check that it works ok. 
Here's the list of Critical/Blocker issues I mentioned before, and their 
current status:


NUTCH-400   Fixed.
NUTCH-353   Moved to Major, fix after release.
NUTCH-233   Fixed.
NUTCH-436   Fixed.
NUTCH-427   Moved to Major, fix after release.
NUTCH-381   Won't fix - this is a configuration issue.
NUTCH-277   Cannot reproduce
NUTCH-167   Fixed.

Any other stuff we need to fix before the release?

--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Issues pending before 0.9 release

2007-03-06 Thread Doug Cutting

Sami Siren wrote:

It would be more beneficial to everybody if the discussions (related to
release or Nutch) is
done on public (hey this is open source!). The off the list stuff IMO
smells.


+1  Folks sometimes wish to discuss project matters off-list to spare 
others the boring details, but this is usually a bad idea.  All project 
decisions should be made in public on this list.  Discussions relevant 
to these decisions are also thus best made on this list, since they 
explain the decision.  Private discussions are permissible to develop a 
proposal, but that is usually better done on-list when possible, so that 
others can get involved earlier.


(The one notable exception is that personnel issues are discussed on the 
private PMC list.)


Doug


Re: Issues pending before 0.9 release

2007-03-05 Thread Chris Mattmann
Hi Guys,

 Blocker
 
 * NUTCH-400 (Update  add missing license headers) - I believe this is
 fixed and should be closed

+1, thanks to Sami for closing it.

 
 * NUTCH-353 (pages that serverside forwards will be refetched every
 time) - this was partially fixed in NUTCH-273, but a more complete
 solution would require significant changes to LinkDb. As there are no
 patches implementing this, I left it open, but it's no longer as
 critical as it was before. I propose to move it to Major and address
 it in the next release.

+1

 
 * NUTCH-233 (wrong regular expression hang reduce process for ever) - I
 propose to apply the fix provided by Sean Dean and close this issue for now.

+1

 
 Critical
 
 * NUTCH-436 (Incorrect handling of relative paths when the embedded URL
 path is empty). There is no patch available yet. If someone could
 contribute a patch I'd like to see this fixed before the release.

Looks like Dennis is on this one

 
 * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
 certainly not critical (as this is an optional new feature). I propose
 to change it to Major, and make a decision - do we want another plugin
 like parse-mp3 or parse-rtf, or not.

Let's hold off on this: it's not necessary for 0.9, and I don't think
there's been a bunch of traffic on the list identifying this as critical to
get into the sources for the release

 
 * NUTCH-381 (Ignore external link not work as expected) - I'll try to
 reproduce it, and if I find an easy fix I'd like to apply it before the
 release.

+1

 
 * NUTCH-277 (Fetcher dies because of max. redirects) - I wasn't able
 to reproduce it. If there is no updated information on this I propose to
 close it with Can't reproduce.

+1, I had to do something similar with NUTCH-258

 
 * NUTCH-167 (Observation of META NAME=ROBOTS CONTENT=NOARCHIVE) -
 there's a patch which I tested in a limited production env. If there are
 no objections I'd like to apply it before the release.

+1

 
 Major
 =
 There are 84 major issues, but some of them are either invalid, or
 should be minor, or no longer apply and should be closed. Please
 review them if you can and provide some comments or recommendations if
 you think you have some new information.

I will spend some time going through JIRA today and see if there's any
issues that I can find that:

1. Have a patch already
2. Sound like something quick, easy, and not so far-reaching across the
entire Nutch API

 
 
 One decision also that we need to make is which version of Hadoop should
 be included in the release. Current trunk uses 0.10.1, I have a set of
 production-tested patches that use 0.11.2, and today the Hadoop team
 released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
 before our release). The most conservative option is to stay with
 0.10.1, but by the time people start using Nutch this will be a fairly
 old version already. I propose to upgrade to 0.11.2. We could use 0.12.1
 - but in this case with the expectation that we release less than stable
 version of Nutch to be soon followed by a minor stable release ...

I'd agree with the upgrade to 0.11.2, +1


Cheers,
  Chris

P.S. I am going to contact Pitor and coordinate with him: I'd like to be the
release manager for this Nutch release.





Re: Issues pending before 0.9 release

2007-03-05 Thread Dennis Kubes



Chris Mattmann wrote:

Hi Guys,


Blocker

* NUTCH-400 (Update  add missing license headers) - I believe this is
fixed and should be closed


+1, thanks to Sami for closing it.


* NUTCH-353 (pages that serverside forwards will be refetched every
time) - this was partially fixed in NUTCH-273, but a more complete
solution would require significant changes to LinkDb. As there are no
patches implementing this, I left it open, but it's no longer as
critical as it was before. I propose to move it to Major and address
it in the next release.


+1


* NUTCH-233 (wrong regular expression hang reduce process for ever) - I
propose to apply the fix provided by Sean Dean and close this issue for now.


+1


Critical

* NUTCH-436 (Incorrect handling of relative paths when the embedded URL
path is empty). There is no patch available yet. If someone could
contribute a patch I'd like to see this fixed before the release.


Looks like Dennis is on this one


* NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
certainly not critical (as this is an optional new feature). I propose
to change it to Major, and make a decision - do we want another plugin
like parse-mp3 or parse-rtf, or not.


Let's hold off on this: it's not necessary for 0.9, and I don't think
there's been a bunch of traffic on the list identifying this as critical to
get into the sources for the release


* NUTCH-381 (Ignore external link not work as expected) - I'll try to
reproduce it, and if I find an easy fix I'd like to apply it before the
release.


+1


* NUTCH-277 (Fetcher dies because of max. redirects) - I wasn't able
to reproduce it. If there is no updated information on this I propose to
close it with Can't reproduce.


+1, I had to do something similar with NUTCH-258


* NUTCH-167 (Observation of META NAME=ROBOTS CONTENT=NOARCHIVE) -
there's a patch which I tested in a limited production env. If there are
no objections I'd like to apply it before the release.


+1


Major
=
There are 84 major issues, but some of them are either invalid, or
should be minor, or no longer apply and should be closed. Please
review them if you can and provide some comments or recommendations if
you think you have some new information.


I will spend some time going through JIRA today and see if there's any
issues that I can find that:

1. Have a patch already
2. Sound like something quick, easy, and not so far-reaching across the
entire Nutch API



One decision also that we need to make is which version of Hadoop should
be included in the release. Current trunk uses 0.10.1, I have a set of
production-tested patches that use 0.11.2, and today the Hadoop team
released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
before our release). The most conservative option is to stay with
0.10.1, but by the time people start using Nutch this will be a fairly
old version already. I propose to upgrade to 0.11.2. We could use 0.12.1
- but in this case with the expectation that we release less than stable
version of Nutch to be soon followed by a minor stable release ...


I'd agree with the upgrade to 0.11.2, +1


Cheers,
  Chris

P.S. I am going to contact Pitor and coordinate with him: I'd like to be the
release manager for this Nutch release.


I would like to help with this as well, even if it is just watching how 
the process works this time.


Dennis






Re: Issues pending before 0.9 release

2007-03-05 Thread Andrzej Bialecki

Chris Mattmann wrote:

P.S. I am going to contact Pitor and coordinate with him: I'd like to be the
release manager for this Nutch release.
  


Everyone heard that? :) That's cool, thanks!

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Issues pending before 0.9 release

2007-03-04 Thread Sean Dean
As for which Hadoop version is included in the next Nutch release, I share the 
same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I 
can volunteer to test any other version we are interested in, my regular 
fetches are about 13 million URLs and take a couple days to complete.
 
If anyone has a specific Hadoop jar they would like to share I don't mind 
testing it, otherwise I can just build the most popular version from source 
and replace that with my current one. For the record, I've been using Hadoop 
0.9.1 for the longest time without any problems on these somewhat large crawls.


- Original Message 
From: Sami Siren [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Sunday, March 4, 2007 1:50:23 AM
Subject: Re: Issues pending before 0.9 release


Andrzej Bialecki wrote:
 Hi all,
 
 The following issues need to be discussed and appropriate action taken
 before the 0.9 release:
 
 Blocker
 
 * NUTCH-400 (Update  add missing license headers) - I believe this is
 fixed and should be closed

I agree. I should close it.

 * NUTCH-233 (wrong regular expression hang reduce process for ever) - I
 propose to apply the fix provided by Sean Dean and close this issue for
 now.

yes that was the resolution also last time :)

 * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
 certainly not critical (as this is an optional new feature). I propose
 to change it to Major, and make a decision - do we want another plugin
 like parse-mp3 or parse-rtf, or not.

One option would be setting up a separate project outside Apache to host
and maintain these and remove the remaining torsos from Nutch source base.

 One decision also that we need to make is which version of Hadoop should
 be included in the release. Current trunk uses 0.10.1, I have a set of
 production-tested patches that use 0.11.2, and today the Hadoop team
 released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
 before our release). The most conservative option is to stay with
 0.10.1, but by the time people start using Nutch this will be a fairly

0.10.1 is not an option, there is that NPE in sorting that is does not
allow any crawling beyond modes sizes (HADOOP-917). We should upgrade
hadoop to 0.11.2 or 0.12.0 and gather experiences from running it on
reasonable sized crawls, so my suggestion is that don't decide this on
paper.

--
Sami Siren

Re: Issues pending before 0.9 release

2007-03-04 Thread Andrzej Bialecki

Sean Dean wrote:

As for which Hadoop version is included in the next Nutch release, I share the 
same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I 
can volunteer to test any other version we are interested in, my regular 
fetches are about 13 million URLs and take a couple days to complete.
 
If anyone has a specific Hadoop jar they would like to share I don't mind testing it, otherwise I can just build the most popular version from source and replace that with my current one. For the record, I've been using Hadoop 0.9.1 for the longest time without any problems on these somewhat large crawls.


  


It's clear to me then that we should bring Nutch to 0.11.2 first anyway. 
Then, if we have time and if you are willing, we could test the 0.12 and 
if it's stable enough for your 13 mln crawl then it's likely it's good 
enough for the rest of us.


If there are no dissenting votes, I'll apply the patch to bring in 
0.11.2 some time tomorrow. I will also create a JIRA issue and attach 
the patches from that revision to Hadoop 0.12 so that folks may test them.


Thanks for your comments!

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Issues pending before 0.9 release

2007-03-04 Thread Dennis Kubes

NUTCH-436 has a patch now if we want to add that to this release.

Dennis Kubes

Andrzej Bialecki wrote:

Sean Dean wrote:
As for which Hadoop version is included in the next Nutch release, I 
share the same concern as Sami with 0.10.1 as it NPE's on anything 
above 100-200k URLs. I can volunteer to test any other version we are 
interested in, my regular fetches are about 13 million URLs and take a 
couple days to complete.
 
If anyone has a specific Hadoop jar they would like to share I don't 
mind testing it, otherwise I can just build the most popular version 
from source and replace that with my current one. For the record, I've 
been using Hadoop 0.9.1 for the longest time without any problems on 
these somewhat large crawls.


  


It's clear to me then that we should bring Nutch to 0.11.2 first anyway. 
Then, if we have time and if you are willing, we could test the 0.12 and 
if it's stable enough for your 13 mln crawl then it's likely it's good 
enough for the rest of us.


If there are no dissenting votes, I'll apply the patch to bring in 
0.11.2 some time tomorrow. I will also create a JIRA issue and attach 
the patches from that revision to Hadoop 0.12 so that folks may test them.


Thanks for your comments!



Re: Issues pending before 0.9 release

2007-03-03 Thread Sami Siren
Andrzej Bialecki wrote:
 Hi all,
 
 The following issues need to be discussed and appropriate action taken
 before the 0.9 release:
 
 Blocker
 
 * NUTCH-400 (Update  add missing license headers) - I believe this is
 fixed and should be closed

I agree. I should close it.

 * NUTCH-233 (wrong regular expression hang reduce process for ever) - I
 propose to apply the fix provided by Sean Dean and close this issue for
 now.

yes that was the resolution also last time :)

 * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
 certainly not critical (as this is an optional new feature). I propose
 to change it to Major, and make a decision - do we want another plugin
 like parse-mp3 or parse-rtf, or not.

One option would be setting up a separate project outside Apache to host
and maintain these and remove the remaining torsos from Nutch source base.

 One decision also that we need to make is which version of Hadoop should
 be included in the release. Current trunk uses 0.10.1, I have a set of
 production-tested patches that use 0.11.2, and today the Hadoop team
 released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
 before our release). The most conservative option is to stay with
 0.10.1, but by the time people start using Nutch this will be a fairly

0.10.1 is not an option, there is that NPE in sorting that is does not
allow any crawling beyond modes sizes (HADOOP-917). We should upgrade
hadoop to 0.11.2 or 0.12.0 and gather experiences from running it on
reasonable sized crawls, so my suggestion is that don't decide this on
paper.

--
 Sami Siren


Issues pending before 0.9 release

2007-03-02 Thread Andrzej Bialecki

Hi all,

The following issues need to be discussed and appropriate action taken 
before the 0.9 release:


Blocker

* NUTCH-400 (Update  add missing license headers) - I believe this is 
fixed and should be closed


* NUTCH-353 (pages that serverside forwards will be refetched every 
time) - this was partially fixed in NUTCH-273, but a more complete 
solution would require significant changes to LinkDb. As there are no 
patches implementing this, I left it open, but it's no longer as 
critical as it was before. I propose to move it to Major and address 
it in the next release.


* NUTCH-233 (wrong regular expression hang reduce process for ever) - I 
propose to apply the fix provided by Sean Dean and close this issue for now.


Critical

* NUTCH-436 (Incorrect handling of relative paths when the embedded URL 
path is empty). There is no patch available yet. If someone could 
contribute a patch I'd like to see this fixed before the release.


* NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's 
certainly not critical (as this is an optional new feature). I propose 
to change it to Major, and make a decision - do we want another plugin 
like parse-mp3 or parse-rtf, or not.


* NUTCH-381 (Ignore external link not work as expected) - I'll try to 
reproduce it, and if I find an easy fix I'd like to apply it before the 
release.


* NUTCH-277 (Fetcher dies because of max. redirects) - I wasn't able 
to reproduce it. If there is no updated information on this I propose to 
close it with Can't reproduce.


* NUTCH-167 (Observation of META NAME=ROBOTS CONTENT=NOARCHIVE) - 
there's a patch which I tested in a limited production env. If there are 
no objections I'd like to apply it before the release.


Major
=
There are 84 major issues, but some of them are either invalid, or 
should be minor, or no longer apply and should be closed. Please 
review them if you can and provide some comments or recommendations if 
you think you have some new information.



One decision also that we need to make is which version of Hadoop should 
be included in the release. Current trunk uses 0.10.1, I have a set of 
production-tested patches that use 0.11.2, and today the Hadoop team 
released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time 
before our release). The most conservative option is to stay with 
0.10.1, but by the time people start using Nutch this will be a fairly 
old version already. I propose to upgrade to 0.11.2. We could use 0.12.1 
- but in this case with the expectation that we release less than stable 
version of Nutch to be soon followed by a minor stable release ...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com