Aaron Binns wrote:
Our usage of Nutch is focused on index building and search services. We
don't use the crawling/fetching features at all. We use Heritrix.
Typically, our large-scale harvests are performed over 8-12 week
periods, then the archived data is handed off to me for full-text
Andrzej Bialecki a...@getopt.org writes:
One of the biggest boons of Nutch is the Hadoop infrastructure. When
indexing massive data sets, being able to fire up 60+ nodes in a
Hadoop system helps tremendously.
Are you familiar with the distributed indexing package in Hadoop
contrib/ ?
AA{hb
- Original Message -
From: Aaron Binns aa...@archive.org
To: nutch-dev@lucene.apache.org nutch-dev@lucene.apache.org
Sent: Tue May 19 13:23:37 2009
Subject: Re: The Future of Nutch, reactivated
Andrzej Bialecki a...@getopt.org writes:
One of the biggest boons of Nutch is the
R
- Original Message -
From: Aaron Binns aa...@archive.org
To: nutch-dev@lucene.apache.org nutch-dev@lucene.apache.org
Sent: Tue May 19 13:23:37 2009
Subject: Re: The Future of Nutch, reactivated
Andrzej Bialecki a...@getopt.org writes:
One of the biggest boons of Nutch is the Hadoop
I would like to point out that Nutch is going to be very essential to our
company's infrastructure-- we're definitely case #1. We'll probably have it
running on 100 boxes in a few weeks.
On Tue, May 19, 2009 at 2:26 PM, Mark Olson mark.ol...@quantum.com wrote:
R
- Original Message -
Hi all,
I just posted some performance figures from a test crawl I did using
an alternative queue-based fetcher (Bixo) at:
http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/
From this data, and my experience using Nutch for vertical crawls