Hi Bejoy, The total number of Maps in the RandomTextWriter execution were 100 and hence the total number of input files for WordCount are 100. My dfs.block.size = 128MB and I have not changed the mapred.max.split.size and could not find it in myJob.xml file. Hence refering the formula *max(minsplitsize, min(maxsplitsize, blocksize))*, I am assuming the mapred.max.split.size to be 128MB. If I calculate the blocks per file [bytes per file / block size (128 MB)] gives me 8.21 for all. And then if I sum up them it becomes 821.22 (Same as my previous calculation).
I have some how managed to do a need copy of the Job.xml in a word doc. I copied it from browser as I cannot recover it in the hdfs. Please find it in the attachment. You may refer the parameters and configuration there. I have also attached the console output for the bytes per file in the WordCount input. Regards, Gaurav Dasgupta On Fri, Aug 17, 2012 at 3:28 PM, Bejoy Ks <bejoy.had...@gmail.com> wrote: > Hi Gaurav > > To add on more clarity to my previous mail > If you are using the default TextInputFormat there will be *atleast* one > task generated per file even if the file size is less than > the block size. (assuming you have split size equal to block size) > > So the right way to calculate the number of splits is per file and not on > the whole input data size. Calculate number of blocks per file and summing > up those values from all files would equate to the number of mappers. > > What is the value of mapred.max.splitsize in your job? If it is less than > the hdfs block size there will be more spits for even for a hdfs block. > > Regards > Bejoy KS > > > >
Job.docx
Description: application/vnd.openxmlformats-officedocument.wordprocessingml.document
Console.docx
Description: application/vnd.openxmlformats-officedocument.wordprocessingml.document