For "control number of mappers" question: You can use http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html which is designed to solve similar cases. However, you cannot beat the speed you get out of a single large file (or a few large files), as you'll still have file open/close overheads which will bog you down.
For "which file is being submitted to which" question: Having https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the version/distribution of Apache Hadoop you use would help. On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <nikhil.agar...@netapp.com> wrote: > Hi, > > > > I have a 3-node cluster, with JobTracker running on one machine and > TaskTrackers on other two. Instead of using HDFS, I have written my own > FileSystem implementation. As an experiment, I kept 1000 text files (all of > same size) on both the slave nodes and ran a simple Wordcount MR job. It > took around 50 mins to complete the task. Afterwards, I concatenated all the > 1000 files into a single file and then ran a Wordcount MR job, it took 35 > secs. From the JobTracker UI I could make out that the problem is because of > the number of mappers that JobTracker is creating. For 1000 files it creates > 1000 maps and for 1 file it creates 1 map (irrespective of file size). > > > > Thus, is there a way to reduce the number of mappers i.e. can I control the > number of mappers through some configuration parameter so that Hadoop would > club all the files until it reaches some specified size (say, 64 MB) and > then make 1 map per 64 MB block? > > > > Also, I wanted to know how to see which file is being submitted to which > TaskTracker or if that is not possible then how do I check if some data > transfer is happening in between my slave nodes during a MR job? > > > > Sorry for so many questions and Thank you for your time. > > > > Regards, > > Nikhil -- Harsh J