7, each node has a datanode and a tasktracker running on it. I attach the full file here:
2014.03.07|10:13:17~/HadoopSetupTest/hadoop-1.2.1/conf>cat mapred-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>mapred.job.tracker</name> <value>compute-1-23:50331</value> <description>The host and port at which the MapReduce job tracker runs. If "local", then jobs are run in-process as a single map and reduce task. </description> </property> <property> <name>mapred.job.tracker.http.address</name> <value>0.0.0.0:50332</value> <description>The port at which the MapReduce task tracker runs. </description> </property> <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:50333</value> <description>The port at which the MapReduce task tracker runs. </description> </property> <property> <name>mapred.tasktracker.map.tasks.maximum</name> <value>7</value> <description>The maximum number of map tasks that will run simultaneously by a task tracker. </description> </property> <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>7</value> <description>The maximum number of reduce tasks that will run simultaneously by a task tracker. </description> </property> <property> <name>mapred.jobtracker.taskScheduler</name> <value>org.apache.hadoop.mapred.FairScheduler</value> </property> <property> <name>mapred.fairscheduler.poolnameproperty</name> <value>pool.name</value> <description>pool name property can be specified in jobconf</description> </property> <property> <name>mapred.local.dir</name> <value>${hadoop.tmp.dir}/mapred/local</value> <description>The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk I/O. Directories that do not exist are ignored. </description> </property> <property> <name>mapred.system.dir</name> <value>${hadoop.tmp.dir}/system/mapred</value> <description>The shared directory where MapReduce stores control files. </description> </property> <property> <name>mapred.tasktracker.dns.interface</name> <value>default</value> <description>The name of the Network Interface from which a task tracker should report its IP address. (e.g. eth0) </description> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx3600m -XX:+UseParallelGC -mx1024m -XX:MaxHeapFreeRatio=10 -XX:MinHeapFreeRatio=10</value> <description>Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. </description> </property> <property> <name>mapred.job.reuse.jvm.num.tasks</name> <value>-1</value> <description>How many tasks to run per jvm. If set to -1, there is no limit.</description> </property> <property> <name>mapred.job.tracker.handler.count</name> <value>40</value> <description>The number of server threads for the JobTracker. This should be roughly 4% of the number of tasktracker nodes.</description> </property> <property> <name>mapred.jobtracker.maxtasks.per.job</name> <value>-1</value> <description>The maximum number of tasks for a single job. A value of -1 indicates that there is no maximum.</description> </property> <property> <name>mapred.tasktracker.expiry.interval</name> <value>600000</value> <description>Time to wait get progress report from a task tracker so that jobtracker decides the task is in progress. default is 1000*60*10 i.e. 10 minutes</description> </property> <property> <name>mapred.task.timeout</name> <value>0</value> <description>Time to wait get progress report from a task tracker so that jobtracker decides the task is in progress. default is 1000*60*10 i.e. 10 minutes</description> </property> <property> <name>mapred.map.tasks.speculative.execution</name> <value>false</value> <description>set the speculative execution for map tasks</description> </property> <property> <name>mapred.reduce.tasks.speculative.execution</name> <value>false</value> <description>set the speculative execution for reduce tasks</description> </property> <property> <name>mapred.hosts.exclude</name> <value>conf/excludes</value> </property> <property> <name>mapred.job.tracker.handler.count</name> <value>40</value> </property> </configuration> 2014-03-07 9:59 GMT-06:00 Claudio Martella <claudio.marte...@gmail.com>: > that depends on your cluster configuration. what is the maximum number of > mappers you can have concurrently on each node? > > > On Fri, Mar 7, 2014 at 4:42 PM, Suijian Zhou <suijian.z...@gmail.com>wrote: > >> The current setting is: >> <name>mapred.child.java.opts</name> >> <value>-Xmx6144m -XX:+UseParallelGC -mx1024m -XX:MaxHeapFreeRatio=10 >> -XX:MinHeapFreeRatio=10</value> >> >> Is 6144MB enough( for each task tracker)? I.e: I have 39 nodes to process >> the 8*2GB input files. >> >> Best Regards, >> Suijian >> >> >> >> 2014-03-07 9:21 GMT-06:00 Claudio Martella <claudio.marte...@gmail.com>: >> >> this setting won't be used by Giraph (or by any mapreduce application), >>> but by the hadoop infrastructure itself. >>> you should use mapred.child.java.opts instead. >>> >>> >>> On Fri, Mar 7, 2014 at 4:19 PM, Suijian Zhou <suijian.z...@gmail.com>wrote: >>> >>>> Hi, Claudio, >>>> I have set the following when ran the program: >>>> export HADOOP_DATANODE_OPTS="-Xmx10g" >>>> and >>>> export HADOOP_HEAPSIZE=30000 >>>> >>>> in hadoop-env.sh and restarted hadoop. >>>> >>>> Best Regards, >>>> Suijian >>>> >>>> >>>> >>>> 2014-03-06 17:29 GMT-06:00 Claudio Martella <claudio.marte...@gmail.com >>>> >: >>>> >>>> did you actually increase the heap? >>>>> >>>>> >>>>> On Thu, Mar 6, 2014 at 11:43 PM, Suijian Zhou >>>>> <suijian.z...@gmail.com>wrote: >>>>> >>>>>> Hi, >>>>>> I tried to process only 2 of the input files, i.e, 2GB + 2GB input, >>>>>> the program finished successfully in 6 minutes. But as I have 39 nodes, >>>>>> they should be enough to load and process the 8*2GB=16GB size graph? Can >>>>>> somebody help to give some hints( Will all the nodes participate in graph >>>>>> loading from HDFS or only master node load the graph?)? Thanks! >>>>>> >>>>>> Best Regards, >>>>>> Suijian >>>>>> >>>>>> >>>>>> >>>>>> 2014-03-06 16:24 GMT-06:00 Suijian Zhou <suijian.z...@gmail.com>: >>>>>> >>>>>> Hi, Experts, >>>>>>> I'm trying to process a graph by pagerank in giraph, but the >>>>>>> program always stucks there. >>>>>>> There are 8 input files, each one is with size ~2GB and all copied >>>>>>> onto HDFS. I use 39 nodes and each node has 16GB Mem and 8 cores. It >>>>>>> keeps >>>>>>> printing the same info(as the following) on the screen after 2 hours, >>>>>>> looks >>>>>>> no progress at all. What are the possible reasons? Testing small example >>>>>>> files run without problems. Thanks! >>>>>>> >>>>>>> 14/03/06 16:17:42 INFO job.JobProgressTracker: Data from 39 workers >>>>>>> - Compute superstep 0: 5854829 out of 49200000 vertices computed; 181 >>>>>>> out >>>>>>> of 1521 partitions computed >>>>>>> 14/03/06 16:17:47 INFO job.JobProgressTracker: Data from 39 workers >>>>>>> - Compute superstep 0: 5854829 out of 49200000 vertices computed; 181 >>>>>>> out >>>>>>> of 1521 partitions computed >>>>>>> >>>>>>> Best Regards, >>>>>>> Suijian >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Claudio Martella >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> Claudio Martella >>> >>> >> >> > > > -- > Claudio Martella > >