Hi all, Had some queries on Map task's awareness. From what I understand, every map task instance is destined to process the data in a specific Input split (can be across HDFS blocks).
1) Do these map tasks have a unique instance number? If yes, are they mapped to its specific input splits and the mapping is done using what parameters (say for eg. map task number to input file byte offset ?) ? where exactly is this hash-map preserved (at what level - jobtracker, tasktracker or each tasks) ? 2) coming to a practical scenario, when I run hadoop in local mode. I run a mapreduce job with 10 maps. Since there is an inherent jvm parallelism (say the node can afford to run 2 map task jvms simultaneously) I assume that there are some map tasks that run concurrently. Since HDFS doesnot play a role in this case, how is the map task instance - to - input split mapping mechanism carried out ? Or do we have a concept of input split at all (will all the maps start scanning from the start of the input file) ? Please help me with these queries.. Thanks, Matthew John