Thank you Allen.
So, is it fair to assume that if I have smaller block size (64 MB), then my 
blocks are distributed across more datanodes and because my blocks are around 
more datanodes, then my map jobs should also run on different datanodes and 
becuase the maps size will be smaller, it should execute faster using less 
resources. 
Should this work this way ? or is there any algorithm on how the blocks should 
be distributed across the datanodes and where should the replication copies 
should go ?

Lets say, I have a file of 640 MB and a cluster with 5 datanodes and configured 
the block size to be 64 MB. How will this be distributed ? 

Regards
Syed Wasti

> From: awittena...@linkedin.com
> To: general@hadoop.apache.org
> Subject: Re: Data Block Size ?
> Date: Thu, 15 Jul 2010 18:49:04 +0000
> 
> 
> On Jul 15, 2010, at 11:40 AM, Syed Wasti wrote:
> 
> > Will it matter what the data block size is ? 
> 
> Yes.
> 
> > It is recommended to have a block size of 64 MB, but if we want to have the 
> > data block size to 128 MB, should this effect the performance ?
> 
> Yes.
> 
> FWIW, we run with 128MB.
> 
> > Does the size of the map jobs created on each datanodes in anyway depend 
> > the block size ?
> 
> Yes.
> 
> Unless told otherwise, Hadoop will generally use the # of maps == # of 
> blocks.  So if you have fewer blocks to process, you'll have fewer maps to do 
> more work.  This is not necessarily a bad thing; it all depends upon your 
> workload, size of grid, etc.
> 
                                          

Reply via email to