Re: How to combine input files for a MapReduce job

Harsh J Mon, 13 May 2013 00:43:49 -0700

Shashwat,

Tweaking the split sizes affects a single input split, not how the
splits are packed. It may be used with the CombineFileInputFormat to
control packed split sizes, but would otherwise not be of use in
"merging" processing of several blocks across files into the same map
task.


On Mon, May 13, 2013 at 1:03 PM, shashwat shriparv
<dwivedishash...@gmail.com> wrote:
> Look into mapred.max.split.size mapred.min.split.size and number of mapper
> in mapred-site.xml
>
> Thanks & Regards
>
> ∞
>
> Shashwat Shriparv
>
>
>
> On Mon, May 13, 2013 at 12:50 PM, Agarwal, Nikhil
> <nikhil.agar...@netapp.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I  have a 3-node cluster, with JobTracker running on one machine and
>> TaskTrackers on other two. Instead of using HDFS, I have written my own
>> FileSystem implementation. As an experiment, I kept 1000 text files (all of
>> same size) on both the slave nodes and ran a simple Wordcount MR job. It
>> took around 50 mins to complete the task. Afterwards, I concatenated all the
>> 1000 files into a single file and then ran a Wordcount MR job, it took 35
>> secs. From the JobTracker UI I could make out that the problem is because of
>> the number of mappers that JobTracker is creating. For 1000 files it creates
>> 1000 maps and for 1 file it creates 1 map (irrespective of file size).
>>
>>
>>
>> Thus, is there a way to reduce the number of mappers i.e. can I control
>> the number of mappers through some configuration parameter so that Hadoop
>> would club all the files until it reaches some specified size (say, 64 MB)
>> and then make 1 map per 64 MB block?
>>
>>
>>
>> Also, I wanted to know how to see which file is being submitted to which
>> TaskTracker or if that is not possible then how do I check if some data
>> transfer is happening in between my slave nodes during a MR job?
>>
>>
>>
>> Sorry for so many questions and Thank you for your time.
>>
>>
>>
>> Regards,
>>
>> Nikhil
>
>



-- 
Harsh J

Re: How to combine input files for a MapReduce job

Reply via email to