The log shows that there are 2 map tasks and 10 reduce tasks. How can there be 10 reduce tasks when I set parameter '-Dmapred.tasktracker.reduce.tasks.maximum=7'? I would like to increase the amount of concurrent map tasks. Any parameter suggestions for that?
It seems that configuration parameter 'mapred.tasktracker.map.tasks.maximum' doesn't grow the number of concurrently running map tasks... Some log rows from mahout cvb: 12/12/03 10:30:23 INFO mapred.JobClient: Job complete: job_201212011004_0432 12/12/03 10:30:23 INFO mapred.JobClient: Counters: 32 12/12/03 10:30:23 INFO mapred.JobClient: File System Counters 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of bytes read=8076460 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of bytes written=18396152 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of read operations=0 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of large read operations=0 12/12/03 10:30:23 INFO mapred.JobClient: FILE: Number of write operations=0 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of bytes read=14054985 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of bytes written=4040120 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of read operations=166 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of large read operations=0 12/12/03 10:30:23 INFO mapred.JobClient: HDFS: Number of write operations=91 12/12/03 10:30:23 INFO mapred.JobClient: Job Counters 12/12/03 10:30:23 INFO mapred.JobClient: Launched map tasks=2 12/12/03 10:30:23 INFO mapred.JobClient: Launched reduce tasks=10 12/12/03 10:30:23 INFO mapred.JobClient: Data-local map tasks=2 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=456617 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=108715 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/12/03 10:30:23 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/12/03 10:30:23 INFO mapred.JobClient: Map-Reduce Framework 12/12/03 10:30:23 INFO mapred.JobClient: Map input records=77332 12/12/03 10:30:23 INFO mapred.JobClient: Map output records=100 12/12/03 10:30:23 INFO mapred.JobClient: Map output bytes=8075900 12/12/03 10:30:23 INFO mapred.JobClient: Input split bytes=288 12/12/03 10:30:23 INFO mapred.JobClient: Combine input records=100 12/12/03 10:30:23 INFO mapred.JobClient: Combine output records=100 12/12/03 10:30:23 INFO mapred.JobClient: Reduce input groups=50 12/12/03 10:30:23 INFO mapred.JobClient: Reduce shuffle bytes=8076520 12/12/03 10:30:23 INFO mapred.JobClient: Reduce input records=100 12/12/03 10:30:23 INFO mapred.JobClient: Reduce output records=50 12/12/03 10:30:23 INFO mapred.JobClient: Spilled Records=200 12/12/03 10:30:23 INFO mapred.JobClient: CPU time spent (ms)=570850 12/12/03 10:30:23 INFO mapred.JobClient: Physical memory (bytes) snapshot=3334303744 12/12/03 10:30:23 INFO mapred.JobClient: Virtual memory (bytes) snapshot=35329503232 12/12/03 10:30:23 INFO mapred.JobClient: Total committed heap usage (bytes)=6070009856 Cheers, Markus 2012/12/3 Markus Paaso <markus.pa...@sagire.fi> > Hi, > > I have some problems to utilize all available CPU power for 'mahout cvb' > command. > The CPU usage is just about 35% and IO wait ~0%. > I have 8 cores and 28 GB memory in a single computer that is running > Mahout 0.7-cdh-4.1.2 with Hadoop 2.0.0-cdh4.1.2 in pseudo-distributed mode. > How can I take advantage of all the CPU power for a single 'mahout cvb' > task? > > > I use following parameters to run mahout cvb: > > mahout cvb > -Ddfs.namenode.handler.count=32 > -Dmapred.job.tracker.handler.count=32 > -Dio.sort.factor=30 > -Dio.sort.mb=500 > -Dio.file.buffer.size=65536 > -Dmapred.child.java.opts=-Xmx2g > -Dmapred.map.child.java.opts=-Xmx2g > -Dmapred.reduce.child.java.opts=-Xmx2g > -Dmapred.job.reuse.jvm.num.tasks=-1 > -Dmapred.map.tasks=7 > -Dmapred.reduce.tasks=7 > -Dmapred.max.split.size=3145728 > -Dmapred.min.split.size=3145728 > -Dmapred.tasktracker.map.tasks.maximum=7 > -Dmapred.tasktracker.reduce.tasks.maximum=7 > -Dmapred.tasktracker.tasks.maximum=7 > --input ~/mahout-files/mydatavectors_int > --output ~/mahout-files/topics > --num_terms 10078 > --num_topics 50 > --doc_topic_output ~/mahout-files/doc-topics > --maxIter 50 > --num_update_threads 8 > --num_train_threads 8 > -block 1 > --test_set_fraction 0.1 > --convergenceDelta 0.0000001 > --tempDir ~/mahout-files/cvb-temp > > > Linux top command says: > > Cpu(s): 33.9%us, 1.1%sy, 0.0%ni, 65.0%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 28479224k total, 16398624k used, 12080600k free, 899576k buffers > Swap: 28942332k total, 0k used, 28942332k free, 5733368k cached > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 19765 mapred 20 0 2811m 650m 16m S 129 2.3 3:59.06 java > 19721 mapred 20 0 2812m 650m 16m S 125 2.3 3:53.70 java > > So just 2.5 / 8 cores are fully in use. > > > Regards, Markus > -- Markus Paaso Developer, Sagire Software Oy http://sagire.fi/