As for which Hadoop version is included in the next Nutch release, I share the
same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I
can volunteer to test any other version we are interested in, my regular
fetches are about 13 million URLs and take a couple days to complete.
If anyone has a specific Hadoop jar they would like to share I don't mind
testing it, otherwise I can just build the "most popular" version from source
and replace that with my current one. For the record, I've been using Hadoop
0.9.1 for the longest time without any problems on these somewhat large crawls.
----- Original Message ----
From: Sami Siren <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, March 4, 2007 1:50:23 AM
Subject: Re: Issues pending before 0.9 release
Andrzej Bialecki wrote:
> Hi all,
>
> The following issues need to be discussed and appropriate action taken
> before the 0.9 release:
>
> Blocker
> ========
> * NUTCH-400 (Update & add missing license headers) - I believe this is
> fixed and should be closed
I agree. I should close it.
> * NUTCH-233 (wrong regular expression hang reduce process for ever) - I
> propose to apply the fix provided by Sean Dean and close this issue for
> now.
yes that was the resolution also last time :)
> * NUTCH-427 (protocol-smb). This relies on a LGPL library, and it's
> certainly not critical (as this is an optional new feature). I propose
> to change it to Major, and make a decision - do we want another plugin
> like parse-mp3 or parse-rtf, or not.
One option would be setting up a separate project outside Apache to host
and maintain these and remove the remaining torsos from Nutch source base.
> One decision also that we need to make is which version of Hadoop should
> be included in the release. Current trunk uses 0.10.1, I have a set of
> production-tested patches that use 0.11.2, and today the Hadoop team
> released 0.12.0 (to be followed shortly by a 0.12.1, most likely in time
> before our release). The most conservative option is to stay with
> 0.10.1, but by the time people start using Nutch this will be a fairly
0.10.1 is not an option, there is that NPE in sorting that is does not
allow any crawling beyond modes sizes (HADOOP-917). We should upgrade
hadoop to 0.11.2 or 0.12.0 and gather experiences from running it on
reasonable sized crawls, so my suggestion is that don't decide this on
paper.
--
Sami Siren
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers