Re: The Future of Nutch, reactivated

2009-05-19 Thread Andrzej Bialecki
Aaron Binns wrote: Our usage of Nutch is focused on index building and search services. We don't use the crawling/fetching features at all. We use Heritrix. Typically, our large-scale harvests are performed over 8-12 week periods, then the archived data is handed off to me for full-text

Re: The Future of Nutch, reactivated

2009-05-19 Thread Aaron Binns
Andrzej Bialecki a...@getopt.org writes: One of the biggest boons of Nutch is the Hadoop infrastructure. When indexing massive data sets, being able to fire up 60+ nodes in a Hadoop system helps tremendously. Are you familiar with the distributed indexing package in Hadoop contrib/ ?

Re: The Future of Nutch, reactivated

2009-05-19 Thread Mark Olson
AA{hb - Original Message - From: Aaron Binns aa...@archive.org To: nutch-dev@lucene.apache.org nutch-dev@lucene.apache.org Sent: Tue May 19 13:23:37 2009 Subject: Re: The Future of Nutch, reactivated Andrzej Bialecki a...@getopt.org writes: One of the biggest boons of Nutch is the

Re: The Future of Nutch, reactivated

2009-05-19 Thread Mark Olson
R - Original Message - From: Aaron Binns aa...@archive.org To: nutch-dev@lucene.apache.org nutch-dev@lucene.apache.org Sent: Tue May 19 13:23:37 2009 Subject: Re: The Future of Nutch, reactivated Andrzej Bialecki a...@getopt.org writes: One of the biggest boons of Nutch is the Hadoop

Re: The Future of Nutch, reactivated

2009-05-19 Thread Bradford Stephens
I would like to point out that Nutch is going to be very essential to our company's infrastructure-- we're definitely case #1. We'll probably have it running on 100 boxes in a few weeks. On Tue, May 19, 2009 at 2:26 PM, Mark Olson mark.ol...@quantum.com wrote: R - Original Message -

Performance issues with queue-based fetching

2009-05-19 Thread Ken Krugler
Hi all, I just posted some performance figures from a test crawl I did using an alternative queue-based fetcher (Bixo) at: http://ken-blog.krugler.org/2009/05/19/performance-problems-with-verticalfocused-web-crawling/ From this data, and my experience using Nutch for vertical crawls