How Mappers function and solultion for my input file problem?

Xuan Dzung Doan Tue, 24 Jun 2008 20:04:36 -0700

Hi,

I'm a Hadoop newbie. My question is as follows:


The level of parallelism of a job, with respect to mappers, is largely the 
number of map tasks spawned, which is equal to the number of InputSplits. But 
within each InputSplit, there may be many records (many input key-value pairs), 
each is processed by one separate call to the map() method. So are these calls 
within one single map task also executed in parallel by the framework?

Now I'm going to write a small map/reduce program that handles a small input 
data file (about a few tens of Mbs). The data file is a text file that contains 
many variable-length string sequences delimited by the star (*) character. The 
program will extract the string sequences from the file  and handle each 
individually in parallel.

Looks like there may be 2 ways to handle the input file:

1) Not have it split by the framework. In other words, there will be only one 
InputSplit which is the entire file and one map task. The RecordReader (?) will 
be responsible for extracting individual string sequences and feeding them to 
the map() method calls. But if the answer to my previous question is No (the 
calls are not processed in parallel), it's pointless to write this program as a 
map/reduce one.

(2) Not feed the data file as an input file to the job, but instead cache it in 
the DistributedCache. Spawn as many map tasks as needed (probably optimally as 
many as the number of sequences). These tasks will not have input file, but 
instead take the input data from the file in the cache (there must be some 
mechanism to make sure each task handles a different sequence).

I'd highly appreciate if anyone could answer my question and/or comment on 
these 2 ways of handling the input file or suggest other ways of doing that.

Thanks,
David.

How Mappers function and solultion for my input file problem?

Reply via email to