Re: Lack of data locality in Hadoop-0.20.2

2011-07-13 Thread Virajith Jalaparti
Hi Matei, Using the fair scheduler of the cloudera distribution seems to have (mostly) solved the problem. Thanks a lot for the suggestion. -Virajith On Tue, Jul 12, 2011 at 7:23 PM, Matei Zaharia wrote: > Hi Virajith, > > The default FIFO scheduler just isn't optimized for locality for small >

RE: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Aaron Baff
s not having to read all that much extra data. --Aaron - From: Virajith Jalaparti [mailto:virajit...@gmail.com] Sent: Tuesday, July 12, 2011 3:21 PM To: mapreduce-user@hadoop.apache.org Subject: Re: Lack of data locality in Hadoop-0

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
Is the non-data local nature of the maps possible due to the amount of HDFS data read by each map being greater than the HDFS block size? In the job I was running, the HDFS block size dfs.block.size was 134217728 and the HDFS_BYTES_READ by the maps was 134678218 and FILE_BYTES_READ was 134698338.

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
On 7/12/2011 7:20 PM, Allen Wittenauer wrote: On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote: I agree that the scheduler has lesser leeway when the replication factor is 1. However, I would still expect the number of data-local tasks to be more than 10% even when the replication factor

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Matei Zaharia
Hi Virajith, The default FIFO scheduler just isn't optimized for locality for small jobs. You should be able to get substantially more locality even with 1 replica if you use the fair scheduler, although the version of the scheduler in 0.20 doesn't contain the locality optimization. Try the Clo

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Allen Wittenauer
On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote: > I agree that the scheduler has lesser leeway when the replication factor is > 1. However, I would still expect the number of data-local tasks to be more > than 10% even when the replication factor is 1. How did you load your data?

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
I agree that the scheduler has lesser leeway when the replication factor is 1. However, I would still expect the number of data-local tasks to be more than 10% even when the replication factor is 1. Presumably, the scheduler would have greater number of opportunities to schedule data-local tasks as

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Arun C Murthy
As Aaron mentioned the scheduler has very little leeway when you have a single replica. OTOH, schedulers equate rack-locality to node-locality - this makes sense sense for a large-scale system since intra-rack b/w is good enough for most installs of Hadoop. Arun On Jul 12, 2011, at 7:36 AM, V

RE: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Aaron Baff
? --Aaron From: Virajith Jalaparti [mailto:virajit...@gmail.com] Sent: Tuesday, July 12, 2011 7:37 AM To: mapreduce-user@hadoop.apache.org Subject: Re: Lack of data locality in Hadoop-0.20.2 I am using a replication factor of 1 since I dont to incur the o

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
I am attaching the config files I was using for these runs with this email. I am not sure if something in them is causing this non-data locality of Hadoop. Thanks, Virajith On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti wrote: > I am using a replication factor of 1 since I dont to incur the

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
I am using a replication factor of 1 since I dont to incur the overhead of replication and I am not much worried about reliability. I am just using the default Hadoop scheduler (FIFO, I think!). In case of a single rack, rack-locality doesn't really have any meaning. Obviously everything will run

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Arun C Murthy
Why are you running with replication factor of 1? Also, it depends on the scheduler you are using. The CapacityScheduler in 0.20.203 (not 0.20.2) has much better locality for jobs, similarly with FairScheduler. IAC, running on a single rack with replication of 1 implies rack-locality for all t

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
Harsh, I am assuming you mean the web-interface of the jobtracker, right? What I see there is appended at the end of the email. Is there supposed to be a counter which is equal to the number of data-local jobs? One obvious way to find this would be to look at the location of the input split of eac

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Harsh J
Virajith, You can see the number of data local vs. non.'s counters in the job itself. On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti wrote: > How do I find the number of data-local map tasks that are launched? I > checked the log files but didnt see any information about this. All the map >

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
Each node is configured to run 8map tasks. I am using 2.4 GHz 64-bit Quad Core Xeon using machines. -Virajith On Tue, Jul 12, 2011 at 2:05 PM, Sudharsan Sampath wrote: > what's the map task capacity of each node ? > > On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti > wrote: > >> Hi, >> >> I

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
How do I find the number of data-local map tasks that are launched? I checked the log files but didnt see any information about this. All the map tasks are rack local since I am running the job just using a single rack. >From the completion time per map (comparing it to the case where I have 1Gbps

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Sudharsan Sampath
what's the map task capacity of each node ? On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti wrote: > Hi, > > I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input > data using a 20 node cluster of nodes. HDFS is configured to use 128MB block > size (so 1600maps are created

Re: Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Harsh J
How much of bandwidth did you see being utilized? What was the count of number of tasks launched as data-local map tasks versus rack local ones? A little bit of edge record data is always read over network but that is highly insignificant compared to the amount of data read locally (a whole block

Lack of data locality in Hadoop-0.20.2

2011-07-12 Thread Virajith Jalaparti
Hi, I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input data using a 20 node cluster of nodes. HDFS is configured to use 128MB block size (so 1600maps are created) and a replication factor of 1 is being used. All the 20 nodes are also hdfs datanodes. I was using a bandwidth v