Re: Nutch download speed

Doğacan Güney Thu, 16 Jul 2009 06:41:25 -0700

On Thu, Jul 16, 2009 at 16:11, Hrishikesh
Agashe<[email protected]> wrote:
> Hi,
>
> We are running Nutch 1.0 on 4 VM Hadoop cluster (Each VM is: 2 CPU, Quad 
> core, 3 GB RAM) located on data center with data storage on a common NAS. The 
> bandwidth available to us is 150 Mb / sec. Theoretical calculation tells me 
> that I can download 1.62 TB data per day (150 * 60 *60 * 24) / 8 = 1620000 MB.
>
> Now my aim is to tune Nutch to get as close as possible to this figure.
> I played a lot with different Nutch params (num of maps=17, num of reduce=7, 
> num of threads=800, fairness=3 sec etc), but max I could get is 3GB / hour, 
> which is 72 GB per day, which is way less than 1.62 TB. I have set filters to 
> just download html text. No images, no videos etc.
>
> So wanted to know about what all params constitute to the speed of Nutch data 
> download?
> Am I missing some very obvious thing? Are number of machines too less? Is 
> hardware configuration not powerful enough?
>


What are you downloading? Remember that nutch waits between successive
requests to the
same host, so you may simply be running out of hosts to fetch (so
fetcher just waits).

However, several people suggested that Fetcher in Nutch 1.0 _is_ slower:

https://issues.apache.org/jira/browse/NUTCH-721

My recommendation would be to use OldFetcher class in trunk to see if
it makes a difference.

> TIA,
> -Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
>



-- 
Doğacan Güney

Re: Nutch download speed

Reply via email to