RE: Lack of data locality in Hadoop-0.20.2

Aaron Baff Tue, 12 Jul 2011 10:11:20 -0700

Well, if you think about it, you'll have more/better locality if more nodes 
with the same blocks. It gives the scheduler more leeway to find a node that 
has a block that hasn't been processed yet. Have you tried it with replication 
of 2 or 3 and seen what that does?

--Aaron

--------------------------------------------------------
From: Virajith Jalaparti [mailto:virajit...@gmail.com]
Sent: Tuesday, July 12, 2011 7:37 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Lack of data locality in Hadoop-0.20.2

I am using a replication factor of 1 since I dont to incur the overhead of 
replication and I am not much worried about reliability.

I am just using the default Hadoop scheduler (FIFO, I think!). In case of a 
single rack, rack-locality doesn't really have any meaning. Obviously 
everything will run in the same rack. I am concerned about data-local maps. I 
assumed that Hadoop would do a much better job at ensuring data-local maps but 
it doesnt seem to be the case here.

-Virajith
On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy <a...@hortonworks.com> wrote:
Why are you running with replication factor of 1?

Also, it depends on the scheduler you are using. The CapacityScheduler in 
0.20.203 (not 0.20.2) has much better locality for jobs, similarly with 
FairScheduler.

IAC, running on a single rack with replication of 1 implies rack-locality for 
all tasks which, in most cases, is good enough.

Arun

On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:

> Hi,
>
> I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input 
> data using a 20 node cluster of nodes. HDFS is configured to use 128MB block 
> size (so 1600maps are created) and a replication factor of 1 is being used. 
> All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of 
> 50Mbps between each of the nodes (this was configured using linux "tc"). I 
> see that around 90% of the map tasks are reading data over the network i.e. 
> most of the map tasks are not being scheduled at the nodes where the data to 
> be processed by them is located.
> My understanding was that Hadoop tries to schedule as many data-local maps as 
> possible. But in this situation, this does not seem to happen. Any reason why 
> this is happening? and is there a way to actually configure hadoop to ensure 
> the maximum possible node locality?
> Any help regarding this is very much appreciated.
>
> Thanks,
> Virajith

RE: Lack of data locality in Hadoop-0.20.2

Reply via email to