what is diff between mapred.map.tasks and mapred.tasktracker.map.tasks.maximum
Hi, I am using nutch with 10 node cluster. I want to configure nutch-site.xml What is difference between mapred.map.tasks and mapred.tasktracker.map.tasks.maximum Or mapred.reduce.tasks and mapred.tasktracker.reduce.tasks.maximum Thanks -Pravin From: Pravin Karne Sent: Thursday, July 02, 2009 12:16 PM To: 'nutch-dev@lucene.apache.org' Subject: Nutch is very slowwhat does following graph shows Hi, I have 10 node Nutch cluster. I have following report. Cluster have very low (slow) performance.(I am not using indexing...using nutch as web crawler) What following reports shows... Even I have 10 node cluster at time shows only # running tasks as 3 Is this expected behavior or have to configure nutch in optimized way if so ..how to do that? Thanks Pravin DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
mapred.map.tasks
property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter?
Re: mapred.map.tasks
Anton Potehin wrote: We have a question on this property. Is it really preferred to set this parameter several times greater than number of available hosts? We do not understand why it should be so? It should be at least numHosts*mapred.tasktracker.tasks.maximum, so that all of the task slots are used. More tasks makes recovery faster when a task fails, since less needs to be redone. Our spider is distributed among 3 machines. What value is most preferred for this parameter in our case? Which other factors may have effect on most preferred value of this parameter? When fetching, the total number of hosts you're fetching can also be a factor, since fetch tasks are hostwise-disjoint. If you're only fetching a few hosts, then a large value for mapred.map.tasks will cause there to be a few big fetch tasks and a bunch of empty ones. This could be a problem if the big ones are not allocated evenly among your nodes. I generally use 5*numHosts*mapred.tasktracker.tasks.maximum. Doug
Re: mapred.map.tasks
[EMAIL PROTECTED] wrote: Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem. Can you please post a simple example that demonstrates the negative progress problem? E.g., the minimal changes to your conf/ directory required to illustrate this, how you start your daemons, etc. Thanks, Doug
RE: mapred.map.tasks
I tried to launch mapred on 2 machines: 192.168.0.250 and 192.168.0.111. In nutch-site.xml I specified parameters: 1) On the both machines: property namefs.default.name/name value192.168.0.250:9009/value descriptionThe name of the default file system. Either the literal string local or a host:port for NDFS./description /property property namemapred.job.tracker/name value192.168.0.250:9010/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.map.tasks/name value2/value descriptionThe default number of map tasks per job. Typically set to a prime several times greater than number of available hosts. Ignored when mapred.job.tracker is local. /description /property property namemapred.tasktracker.tasks.maximum/name value2/value descriptionThe maximum number of tasks that will be run simultaneously by a task tracker. /description /property property namemapred.reduce.tasks/name value2/value descriptionThe default number of reduce tasks per job. Typically set to a prime close to the number of available hosts. Ignored when mapred.job.tracker is local. /description /property On 192.168.0.250 I started: 2) bin/nutch-daemon.sh start datanode 3) bin/nutch-daemon.sh start namenode 4) bin/nutch-daemon.sh start jobtracker 5) bin/nutch-daemon.sh start tasktracker I created directory seeds and file urls in it. Urls contained 2 links. Then I added that directory to NDFS (bin/nutch ndfs -put ./seeds seeds). Directory was added successfully.. Then I launched command: bin/nutch crawl seeds -depth 2 I a result I received log written by jobtracker: 051123 053118 Adding task 'task_m_z66npx' to set for tracker 'tracker_53845' 051123 053118 Adding task 'task_m_xaynqo' to set for tracker 'tracker_11518' 051123 053130 Task 'task_m_z66npx' has finished successfully. Log written by tasktracker on 192.168.0.111: .. 051110 142607 task_m_z66npx 0.0% /user/root/seeds/urls:0+31 051110 142607 task_m_z66npx 1.0% /user/root/seeds/urls:0+31 051110 142607 Task task_m_z66npx is done. Log written by tasktracker on 192.168.0.250: 051123 053125 task_m_xaynqo 0.12903225% /user/root/seeds/urls:31+31 051123 053126 task_m_xaynqo -683.9677% /user/root/seeds/urls:31+31 051123 053127 task_m_xaynqo -2129.9678% /user/root/seeds/urls:31+31 051123 053128 task_m_xaynqo -3483.0322% /user/root/seeds/urls:31+31 051123 053129 task_m_xaynqo -4976.2256% /user/root/seeds/urls:31+31 051123 053130 task_m_xaynqo -6449.1934% /user/root/seeds/urls:31+31 051123 053131 task_m_xaynqo -7898.258% /user/root/seeds/urls:31+31 051123 053132 task_m_xaynqo -9232.193% /user/root/seeds/urls:31+31 051123 053133 task_m_xaynqo -10694.3545% /user/root/seeds/urls:31+31 051123 053134 task_m_xaynqo -12139.226% /user/root/seeds/urls:31+31 051123 053135 task_m_xaynqo -13416.677% /user/root/seeds/urls:31+31 051123 053136 task_m_xaynqo -14885.741% /user/root/seeds/urls:31+31 ... and so on... e.g. in this log were records with reducing percents. I concluded that was an attempt to separate inject to 2 machines e.g. were 2 tasks: 'task_m_z66npx' and 'task_m_xaynqo'. And 'task_m_z66npx' was finished successfully and 'task_m_xaynqo' caused some problems (negative progress). But if I change parameter mapred.reduce.tasks to 4 all tasks finished successfully and all work right. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Tuesday, November 22, 2005 2:10 AM To: nutch-dev@lucene.apache.org Subject: Re: mapred.map.tasks [EMAIL PROTECTED] wrote: Why we need parameter mapred.map.tasks greater than number of available host? If we set it equal to number of host, we got negative progress percentages problem. Can you please post a simple example that demonstrates the negative progress problem? E.g., the minimal changes to your conf/ directory required to illustrate this, how you start your daemons, etc. Thanks, Doug