I am using a replication factor of 1 since I dont to incur the overhead of
replication and I am not much worried about reliability.

I am just using the default Hadoop scheduler (FIFO, I think!). In case of a
single rack, rack-locality doesn't really have any meaning. Obviously
everything will run in the same rack. I am concerned about data-local maps.
I assumed that Hadoop would do a much better job at ensuring data-local maps
but it doesnt seem to be the case here.

-Virajith

On Tue, Jul 12, 2011 at 3:30 PM, Arun C Murthy <a...@hortonworks.com> wrote:

> Why are you running with replication factor of 1?
>
> Also, it depends on the scheduler you are using. The CapacityScheduler in
> 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
> FairScheduler.
>
> IAC, running on a single rack with replication of 1 implies rack-locality
> for all tasks which, in most cases, is good enough.
>
> Arun
>
> On Jul 12, 2011, at 5:45 AM, Virajith Jalaparti wrote:
>
> > Hi,
> >
> > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
> data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
> size (so 1600maps are created) and a replication factor of 1 is being used.
> All the 20 nodes are also hdfs datanodes. I was using a bandwidth value of
> 50Mbps between each of the nodes (this was configured using linux "tc"). I
> see that around 90% of the map tasks are reading data over the network i.e.
> most of the map tasks are not being scheduled at the nodes where the data to
> be processed by them is located.
> > My understanding was that Hadoop tries to schedule as many data-local
> maps as possible. But in this situation, this does not seem to happen. Any
> reason why this is happening? and is there a way to actually configure
> hadoop to ensure the maximum possible node locality?
> > Any help regarding this is very much appreciated.
> >
> > Thanks,
> > Virajith
>
>

Reply via email to