Xuan Dzung Doan wrote:
Hi,
I'm a Hadoop newbie. My question is as follows:
The level of parallelism of a job, with respect to mappers, is largely the
number of map tasks spawned, which is equal to the number of InputSplits. But
within each InputSplit, there may be many records (many input key-value pairs),
each is processed by one separate call to the map() method. So are these calls
within one single map task also executed in parallel by the framework?
Afaik no.
Now I'm going to write a small map/reduce program that handles a small input
data file (about a few tens of Mbs). The data file is a text file that contains
many variable-length string sequences delimited by the star (*) character. The
program will extract the string sequences from the file and handle each
individually in parallel.
Looks like there may be 2 ways to handle the input file:
1) Not have it split by the framework. In other words, there will be only one
InputSplit which is the entire file and one map task. The RecordReader (?) will
be responsible for extracting individual string sequences and feeding them to
the map() method calls. But if the answer to my previous question is No (the
calls are not processed in parallel), it's pointless to write this program as a
map/reduce one.
(2) Not feed the data file as an input file to the job, but instead cache it in
the DistributedCache. Spawn as many map tasks as needed (probably optimally as
many as the number of sequences). These tasks will not have input file, but
instead take the input data from the file in the cache (there must be some
mechanism to make sure each task handles a different sequence).
I would prefer writing my own InputFormat to do the splitting and have a
very high replication factor for the input file. This way you can easily
increase the file size without any additional changes.
I'd highly appreciate if anyone could answer my question and/or comment on
these 2 ways of handling the input file or suggest other ways of doing that.
Thanks,
David.