In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block size
<property> <name>dfs.block.size</name> <value>134217728</value> <final>true</final> </property> While running the teragen I am again specifying it to be sure: hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728 320000 /user/hadoop/input but it generates 3 blocks: hadoop fsck -blocks -files -locations /user/hadoop/input Status: HEALTHY Total size: 32029543 B Total dirs: 3 Total files: 4 Total blocks (validated): 3 (avg. block size 10676514 B) Minimally replicated blocks: 3 (100.0 %) What I am doing wrong? How can I generate only one block? On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury < arindamchoudhu...@gmail.com> wrote: > Thanks . As Julien said I want to do a performance measurement. > > Actually, > > hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 > -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map > > has generated: > Total size: 3200029737 B > Total dirs: 3 > Total files: 5 > Total blocks (validated): 27 (avg. block size 118519619 B) > > Thats why so many maps. > > > On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller > <julien.mul...@ezako.com>wrote: > >> Maybe your goal is to have a baseline for performance measurement? >> In that case, you might want to consider running only one taskTracker? >> You would have multiple tasks but running on only 1 machine. Also, you >> could make mappers run serially, by configuring only one map slot on your 1 >> node cluster. >> >> Nevertheless I agree with Bertrand, this is not really a realistic use >> case (or maybe you can give us more clues). >> >> Julien >> >> >> 2013/2/26 Bertrand Dechoux <decho...@gmail.com> >> >>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces >>> >>> It is possible to have a single mapper if the input is not splittable >>> BUT it is rarely seen as a feature. >>> One could ask why you want to use a platform for distributed computing >>> for a job that shouldn't be distributed. >>> >>> Regards >>> >>> Bertrand >>> >>> >>> >>> On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury < >>> arindamchoudhu...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I am trying to run terasort using one map and one reduce. so, I >>>> generated the input data using: >>>> >>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1 >>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map >>>> >>>> Then I launched the hadoop terasort job using: >>>> >>>> hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1 >>>> -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1 >>>> >>>> I thought it will run the job using 1 map and 1 reduce, but when >>>> inspect the job statistics I found: >>>> >>>> hadoop job -history /user/hadoop/output1 >>>> >>>> Task Summary >>>> ============================ >>>> Kind Total Successful Failed Killed StartTime >>>> FinishTime >>>> >>>> Setup 1 1 0 0 26-Feb-2013 10:57:47 26-Feb-2013 >>>> 10:57:55 (8sec) >>>> Map 24 24 0 0 26-Feb-2013 10:57:57 26-Feb-2013 >>>> 11:05:37 (7mins, 40sec) >>>> Reduce 1 1 0 0 26-Feb-2013 10:58:21 26-Feb-2013 >>>> 11:08:31 (10mins, 10sec) >>>> Cleanup 1 1 0 0 26-Feb-2013 11:08:32 26-Feb-2013 >>>> 11:08:36 (4sec) >>>> ============================ >>>> >>>> so, though I mentioned to launch one map tasks, there are 24 of them. >>>> >>>> How to solve this problem. How to tell hadoop to launch only one map. >>>> >>>> Thanks, >>>> >>> >>> >> >