Similarly there is the NLineInputFormat that does this automatically.  If your 
input is small it will read in the input and make a split for every N lines of 
input.   Then you don't have to reformat your data files.

--Bobby Evans

On 1/10/12 8:09 AM, "GorGo" <gylf...@ru.is> wrote:



Hi.

I am no expert, but you could try this.

Your problem, I guess, is that the record reader reads multiple lines of
work (tasks) and gives to each mapper and thus if you only have a few tasks
(line of work in the input file) Hadoop will not spawn multiple mappers.

You could try this, make each input record, in your now single input file,
an independent file (with only one line) and give as input to your job the
the directory with the files (not a single file). For me, this forced the
spawning of multiple mappers.

There is another more correct way that forces the spawning of a map task for
each line but as I was using c++ pipes that was not an option for me.

Hope this helps
   GorGo



sset wrote:
>
> Hello,
>
> In hdfs we have set block size - 40bytes . Input Data set is as below
> terminated with line feed.
>
> data1   (5*8=40 bytes)
> data2
> ......
> .......
> data10
>
>
> But still we see only 2 map tasks spawned, should have been atleast 10 map
> tasks. Each mapper performs complex mathematical computation. Not sure how
> works internally. Line feed does not work. Even with below settings map
> tasks never goes beyound 2, any way to make this spawn 10 tasks. Basically
> it should look like compute grid - computation in parallel.
>
> <property>
>   <name>io.bytes.per.checksum</name>
>   <value>30</value>
>   <description>The number of bytes per checksum.  Must not be larger than
>   io.file.buffer.size.</description>
> </property>
>
> <property>
>   <name>dfs.block.size</name>
>    <value>30</value>
>   <description>The default block size for new files.</description>
> </property>
>
> <property>
>   <name>mapred.tasktracker.map.tasks.maximum</name>
>   <value>10</value>
>   <description>The maximum number of map tasks that will be run
>   simultaneously by a task tracker.
>   </description>
> </property>
>
> single node with high configuration -> 8 cpus and 8gb memory. Hence taking
> an example of 10 data items with line feeds. We want to utilize full power
> of machine - hence want at least 10 map tasks - each task needs to perform
> highly complex mathematical simulation.  At present it looks like file
> data is the only way to specify number of map tasks via splitsize (in
> bytes) - but I prefer some criteria like line feed or whatever.
>
> How do we get 10 map tasks from above configuration - pls help.
>
> thanks
>
>

--
View this message in context: 
http://old.nabble.com/increase-number-of-map-tasks-tp33107775p33111745.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Reply via email to