what is diff between mapred.map.tasks and mapred.tasktracker.map.tasks.maximum

2009-07-02 Thread Pravin Karne
Hi,
I am using nutch with 10 node cluster.
I want to configure nutch-site.xml
What is difference between mapred.map.tasks  and  
mapred.tasktracker.map.tasks.maximum
Or
mapred.reduce.tasks and mapred.tasktracker.reduce.tasks.maximum

Thanks
-Pravin

From: Pravin Karne
Sent: Thursday, July 02, 2009 12:16 PM
To: 'nutch-dev@lucene.apache.org'
Subject: Nutch is very slowwhat does following graph shows

Hi,
I have 10 node Nutch cluster.

I have following report. Cluster have very low (slow) performance.(I am not 
using indexing...using nutch as web crawler)
What following reports shows...
Even I have 10 node  cluster at time shows only # running tasks as 3
Is this expected behavior or have to configure nutch in optimized way if so 
..how to do that?
Thanks

Pravin


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


mapred.map.tasks

2006-04-20 Thread Anton Potehin
property

  namemapred.map.tasks/name

  value2/value

  descriptionThe default number of map tasks per job.  Typically set

  to a prime several times greater than number of available hosts.

  Ignored when mapred.job.tracker is local.  

  /description

/property

 

We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so? 

Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?  

 



Re: mapred.map.tasks

2006-04-20 Thread Doug Cutting

Anton Potehin wrote:

We have a question on this property. Is it really preferred to set this
parameter several times greater than number of available hosts? We do
not understand why it should be so? 


It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that 
all of the task slots are used.  More tasks makes recovery faster when a 
task fails, since less needs to be redone.



Our spider is distributed among 3 machines. What value is most preferred
for this parameter in our case? Which other factors may have effect on
most preferred value of this parameter?  


When fetching, the total number of hosts you're fetching can also be a 
factor, since fetch tasks are hostwise-disjoint.  If you're only 
fetching a few hosts, then a large value for mapred.map.tasks will cause 
there to be a few big fetch tasks and a bunch of empty ones.  This could 
be a problem if the big ones are not allocated evenly among your nodes.


I generally use 5*numHosts*mapred.tasktracker.tasks.maximum.

Doug


Re: mapred.map.tasks

2005-11-21 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

Why we need parameter mapred.map.tasks greater than number of available
host? If we set it equal to number of host, we got negative progress
percentages problem.


Can you please post a simple example that demonstrates the negative 
progress problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.


Thanks,

Doug


RE: mapred.map.tasks

2005-11-21 Thread anton
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111.

In nutch-site.xml I specified parameters:

1) On the both machines:
property
  namefs.default.name/name
  value192.168.0.250:9009/value
  descriptionThe name of the default file system.  Either the
  literal string local or a host:port for NDFS./description
/property

property
  namemapred.job.tracker/name
  value192.168.0.250:9010/value
  descriptionThe host and port that the MapReduce job tracker runs
  at.  If local, then jobs are run in-process as a single map
  and reduce task.
  /description
/property
property
  namemapred.map.tasks/name
  value2/value
  descriptionThe default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is local.  
  /description
/property

property
  namemapred.tasktracker.tasks.maximum/name
  value2/value
  descriptionThe maximum number of tasks that will be run
  simultaneously by a task tracker.
  /description
/property

property
  namemapred.reduce.tasks/name
  value2/value
  descriptionThe default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is local.
  /description
/property
 



On 192.168.0.250 I started:
2)   bin/nutch-daemon.sh start datanode
3)   bin/nutch-daemon.sh start namenode
4)   bin/nutch-daemon.sh start jobtracker
5)   bin/nutch-daemon.sh start tasktracker

I created directory seeds and file urls in it. Urls contained 2 links.
Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds).
Directory was added successfully..

 

Then I launched command: 
bin/nutch crawl seeds -depth 2

I a result I received log written by jobtracker:

051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845'
051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518'
051123 053130 Task 'task_m_z66npx' has finished successfully.
 

Log written by tasktracker on 192.168.0.111:
..
051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31
051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31
051110 142607 Task task_m_z66npx is done.
 

Log written by tasktracker on 192.168.0.250:

051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31
051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31
051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31
051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31
051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31
051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31
051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31
051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31
051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31
051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31
051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31
051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31
... and so on... e.g. in this log were records with reducing percents.

 

I concluded that was an attempt to separate inject to 2 machines e.g.
were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx'
was finished successfully and 'task_m_xaynqo' caused some problems (negative

progress).

But if I change parameter mapred.reduce.tasks to 4 all tasks finished
successfully and all work right.



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, November 22, 2005 2:10 AM
To: nutch-dev@lucene.apache.org
Subject: Re: mapred.map.tasks

[EMAIL PROTECTED] wrote:
 Why we need parameter mapred.map.tasks greater than number of available
 host? If we set it equal to number of host, we got negative progress
 percentages problem.

Can you please post a simple example that demonstrates the negative 
progress problem?  E.g., the minimal changes to your conf/ directory 
required to illustrate this, how you start your daemons, etc.

Thanks,

Doug