Xuan Dzung Doan wrote:
Hi,

I'm a Hadoop newbie. My question is as follows:

The level of parallelism of a job, with respect to mappers, is largely the 
number of map tasks spawned, which is equal to the number of InputSplits. But 
within each InputSplit, there may be many records (many input key-value pairs), 
each is processed by one separate call to the map() method. So are these calls 
within one single map task also executed in parallel by the framework?

Afaik no.
Now I'm going to write a small map/reduce program that handles a small input 
data file (about a few tens of Mbs). The data file is a text file that contains 
many variable-length string sequences delimited by the star (*) character. The 
program will extract the string sequences from the file  and handle each 
individually in parallel.

Looks like there may be 2 ways to handle the input file:

1) Not have it split by the framework. In other words, there will be only one 
InputSplit which is the entire file and one map task. The RecordReader (?) will 
be responsible for extracting individual string sequences and feeding them to 
the map() method calls. But if the answer to my previous question is No (the 
calls are not processed in parallel), it's pointless to write this program as a 
map/reduce one.

(2) Not feed the data file as an input file to the job, but instead cache it in 
the DistributedCache. Spawn as many map tasks as needed (probably optimally as 
many as the number of sequences). These tasks will not have input file, but 
instead take the input data from the file in the cache (there must be some 
mechanism to make sure each task handles a different sequence).
I would prefer writing my own InputFormat to do the splitting and have a very high replication factor for the input file. This way you can easily increase the file size without any additional changes.
I'd highly appreciate if anyone could answer my question and/or comment on 
these 2 ways of handling the input file or suggest other ways of doing that.

Thanks,
David.




Reply via email to