Yes that's right !

Sent from my iPhone

On Oct 25, 2011, at 5:36 PM, <[email protected]> wrote:

> So I guess the job tracker is the one reading the HDFS meta-data and then
> optimizing the scheduling of map jobs based on that?
> 
> 
> On 10/25/11 3:13 PM, "Shevek" <[email protected]> wrote:
> 
>> We pray to $deity that the mapreduce block size is about the same as (or
>> smaller than) the hdfs block size. We also pray that file format
>> synchronization points are frequent when compared to block boundaries.
>> 
>> The JobClient finds the location of each block of each file. It splits the
>> job into FileSplit(s), with one per block.
>> 
>> Each FileSplit is processed by a task. The Split contains the locations in
>> which the task should best be run.
>> 
>> The last block may be very short. It is then subsumed into the preceding
>> block.
>> 
>> Some data is transferred between nodes when the synchronization point for
>> the file format is not at a block boundary. (It basically never is, but we
>> hope it's close, or the purpose of MR locality is defeated.)
>> 
>> Specifically to your questions: Most of the data should be read from the
>> local hdfs node under the above assumptions. The communication layer
>> between
>> mapreduce and hdfs is not special.
>> 
>> S.
>> 
>> On 25 October 2011 11:49, <[email protected]> wrote:
>> 
>>> Hello,
>>> 
>>> I am trying to understand how data locality works in hadoop.
>>> 
>>> If you run a map reduce job do the mappers only read data from the host
>>> on
>>> which they are running?
>>> 
>>> Is there a communication protocol between the map reduce layer and HDFS
>>> layer so that the mapper gets optimized to read data locally?
>>> 
>>> Any pointers on which layer of the stack handles this?
>>> 
>>> Cheers,
>>> Ivan
>>> 
> 

Reply via email to