Hi Harsh, I applied the changes of the patch in hadoop source code but can you please tell exactly where is this log being printed? I checked in log files of JobTracker and TaskTracker but it is not there. It is not getting printed in _logs folder creates inside output directory for MR job.
Regards, Nikhil -----Original Message----- From: Harsh J [mailto:ha...@cloudera.com] Sent: Monday, May 13, 2013 1:28 PM To: <user@hadoop.apache.org> Subject: Re: How to combine input files for a MapReduce job Yes I believe the branch-1 patch attached there should apply cleanly to 1.0.4. On Mon, May 13, 2013 at 1:25 PM, Agarwal, Nikhil <nikhil.agar...@netapp.com> wrote: > Hi, > > @Harsh: Thanks for the reply. Would the patch work in Hadoop 1.0.4 release? > > -----Original Message----- > From: Harsh J [mailto:ha...@cloudera.com] > Sent: Monday, May 13, 2013 1:03 PM > To: <user@hadoop.apache.org> > Subject: Re: How to combine input files for a MapReduce job > > For "control number of mappers" question: You can use > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib > /CombineFileInputFormat.html which is designed to solve similar cases. > However, you cannot beat the speed you get out of a single large file (or a > few large files), as you'll still have file open/close overheads which will > bog you down. > > For "which file is being submitted to which" question: Having > https://issues.apache.org/jira/browse/MAPREDUCE-3678 in the > version/distribution of Apache Hadoop you use would help. > > On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil <nikhil.agar...@netapp.com> > wrote: >> Hi, >> >> >> >> I have a 3-node cluster, with JobTracker running on one machine and >> TaskTrackers on other two. Instead of using HDFS, I have written my >> own FileSystem implementation. As an experiment, I kept 1000 text >> files (all of same size) on both the slave nodes and ran a simple >> Wordcount MR job. It took around 50 mins to complete the task. >> Afterwards, I concatenated all the >> 1000 files into a single file and then ran a Wordcount MR job, it >> took >> 35 secs. From the JobTracker UI I could make out that the problem is >> because of the number of mappers that JobTracker is creating. For >> 1000 files it creates >> 1000 maps and for 1 file it creates 1 map (irrespective of file size). >> >> >> >> Thus, is there a way to reduce the number of mappers i.e. can I >> control the number of mappers through some configuration parameter so >> that Hadoop would club all the files until it reaches some specified >> size (say, 64 MB) and then make 1 map per 64 MB block? >> >> >> >> Also, I wanted to know how to see which file is being submitted to >> which TaskTracker or if that is not possible then how do I check if >> some data transfer is happening in between my slave nodes during a MR job? >> >> >> >> Sorry for so many questions and Thank you for your time. >> >> >> >> Regards, >> >> Nikhil > > > > -- > Harsh J -- Harsh J