Hi, I'm a Hadoop newbie. My question is as follows:
The level of parallelism of a job, with respect to mappers, is largely the number of map tasks spawned, which is equal to the number of InputSplits. But within each InputSplit, there may be many records (many input key-value pairs), each is processed by one separate call to the map() method. So are these calls within one single map task also executed in parallel by the framework? Now I'm going to write a small map/reduce program that handles a small input data file (about a few tens of Mbs). The data file is a text file that contains many variable-length string sequences delimited by the star (*) character. The program will extract the string sequences from the file and handle each individually in parallel. Looks like there may be 2 ways to handle the input file: 1) Not have it split by the framework. In other words, there will be only one InputSplit which is the entire file and one map task. The RecordReader (?) will be responsible for extracting individual string sequences and feeding them to the map() method calls. But if the answer to my previous question is No (the calls are not processed in parallel), it's pointless to write this program as a map/reduce one. (2) Not feed the data file as an input file to the job, but instead cache it in the DistributedCache. Spawn as many map tasks as needed (probably optimally as many as the number of sequences). These tasks will not have input file, but instead take the input data from the file in the cache (there must be some mechanism to make sure each task handles a different sequence). I'd highly appreciate if anyone could answer my question and/or comment on these 2 ways of handling the input file or suggest other ways of doing that. Thanks, David.