Hi Bejoy,

The total number of Maps in the RandomTextWriter execution were 100 and
hence the total number of input files for WordCount are 100.
My dfs.block.size = 128MB and I have not changed the
mapred.max.split.size and could not find it in myJob.xml file.
Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*,
I am assuming the mapred.max.split.size to be 128MB.
If I calculate the blocks per file [bytes per file / block size (128 MB)]
gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as
my previous calculation).

I have some how managed to do a need copy of the Job.xml in a word doc. I
copied it from browser as I cannot recover it in the hdfs. Please find it
in the attachment. You may refer the parameters and configuration there. I
have also attached the console output for the bytes per file in the
WordCount input.

Regards,
Gaurav Dasgupta
On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <bejoy.had...@gmail.com> wrote:

> Hi Gaurav
>
> To add on more clarity to my previous mail
> If you are using the default TextInputFormat there will be *atleast* one
> task generated per file even if the file size is less than
> the block size. (assuming you have split size equal to block size)
>
> So the right way to calculate the number of splits is per file and not on
> the whole input data size. Calculate number of blocks per file and summing
> up those values from all files would equate to the number of mappers.
>
> What is the value of mapred.max.splitsize in your job? If it is less than
> the hdfs block size there will be more spits for even for a hdfs block.
>
> Regards
> Bejoy KS
>
>
>
>

Attachment: Job.docx
Description: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Attachment: Console.docx
Description: application/vnd.openxmlformats-officedocument.wordprocessingml.document

Reply via email to