Re: Running terasort with 1 map task

Arindam Choudhury Tue, 26 Feb 2013 04:23:19 -0800

In my $HADOOP_HOME/conf/hdfs-site.xml, I have mentioned the data-block size


<property>
  <name>dfs.block.size</name>
  <value>134217728</value>
  <final>true</final>
</property>

While running the teragen I am again specifying it to be sure:

hadoop jar /opt/hadoop-1.0.4/hadoop-examples-1.0.4.jar teragen
-Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 -Ddfs.block.size=134217728
320000 /user/hadoop/input

but it generates 3 blocks:

hadoop fsck -blocks -files -locations /user/hadoop/input
Status: HEALTHY
 Total size:    32029543 B
 Total dirs:    3
 Total files:    4
 Total blocks (validated):    3 (avg. block size 10676514 B)
 Minimally replicated blocks:    3 (100.0 %)

What I am doing wrong? How can I generate only one block?



On Tue, Feb 26, 2013 at 12:52 PM, Arindam Choudhury <
arindamchoudhu...@gmail.com> wrote:

> Thanks . As Julien said I want to do a performance measurement.
>
> Actually,
>
> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>
> has generated:
> Total size:    3200029737 B
> Total dirs:    3
> Total files:    5
> Total blocks (validated):    27 (avg. block size 118519619 B)
>
> Thats why so many maps.
>
>
> On Tue, Feb 26, 2013 at 12:46 PM, Julien Muller 
> <julien.mul...@ezako.com>wrote:
>
>> Maybe your goal is to have a baseline for performance measurement?
>> In that case, you might want to consider running only one taskTracker?
>>  You would have multiple tasks but running on only 1 machine. Also, you
>> could make mappers run serially, by configuring only one map slot on your 1
>> node cluster.
>>
>> Nevertheless I agree with Bertrand, this is not really a realistic use
>> case (or maybe you can give us more clues).
>>
>> Julien
>>
>>
>> 2013/2/26 Bertrand Dechoux <decho...@gmail.com>
>>
>>> http://wiki.apache.org/hadoop/HowManyMapsAndReduces
>>>
>>> It is possible to have a single mapper if the input is not splittable
>>> BUT it is rarely seen as a feature.
>>> One could ask why you want to use a platform for distributed computing
>>> for a job that shouldn't be distributed.
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>>
>>>
>>> On Tue, Feb 26, 2013 at 12:09 PM, Arindam Choudhury <
>>> arindamchoudhu...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to run terasort using one map and one reduce. so, I
>>>> generated the input data using:
>>>>
>>>> hadoop jar hadoop-examples-1.0.4.jar teragen -Dmapred.map.tasks=1
>>>> -Dmapred.reduce.tasks=1 32000000 /user/hadoop/input32mb1map
>>>>
>>>> Then I launched the hadoop terasort job using:
>>>>
>>>> hadoop jar hadoop-examples-1.0.4.jar terasort -Dmapred.map.tasks=1
>>>> -Dmapred.reduce.tasks=1 /user/hadoop/input32mb1map /user/hadoop/output1
>>>>
>>>> I thought it will run the job using 1 map and 1 reduce, but when
>>>> inspect the job statistics I found:
>>>>
>>>> hadoop job -history /user/hadoop/output1
>>>>
>>>> Task Summary
>>>> ============================
>>>> Kind    Total    Successful    Failed    Killed    StartTime
>>>> FinishTime
>>>>
>>>> Setup    1    1        0    0    26-Feb-2013 10:57:47    26-Feb-2013
>>>> 10:57:55 (8sec)
>>>> Map    24    24        0    0    26-Feb-2013 10:57:57    26-Feb-2013
>>>> 11:05:37 (7mins, 40sec)
>>>> Reduce    1    1        0    0    26-Feb-2013 10:58:21    26-Feb-2013
>>>> 11:08:31 (10mins, 10sec)
>>>> Cleanup    1    1        0    0    26-Feb-2013 11:08:32    26-Feb-2013
>>>> 11:08:36 (4sec)
>>>> ============================
>>>>
>>>> so, though I mentioned to launch one map tasks, there are 24 of them.
>>>>
>>>> How to solve this problem. How to tell hadoop to launch only one map.
>>>>
>>>> Thanks,
>>>>
>>>
>>>
>>
>

Re: Running terasort with 1 map task

Reply via email to