Hi Matei,
Using the fair scheduler of the cloudera distribution seems to have (mostly)
solved the problem. Thanks a lot for the suggestion.
-Virajith
On Tue, Jul 12, 2011 at 7:23 PM, Matei Zaharia wrote:
> Hi Virajith,
>
> The default FIFO scheduler just isn't optimized for locality for small
>
s not having to read
all that much extra data.
--Aaron
-
From: Virajith Jalaparti [mailto:virajit...@gmail.com]
Sent: Tuesday, July 12, 2011 3:21 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Lack of data locality in Hadoop-0
Is the non-data local nature of the maps possible due to the amount of HDFS
data read by each map being greater than the HDFS block size? In the job I
was running, the HDFS block size dfs.block.size was 134217728 and the
HDFS_BYTES_READ by the maps was 134678218 and FILE_BYTES_READ was 134698338.
On 7/12/2011 7:20 PM, Allen Wittenauer wrote:
On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:
I agree that the scheduler has lesser leeway when the replication factor is
1. However, I would still expect the number of data-local tasks to be more
than 10% even when the replication factor
Hi Virajith,
The default FIFO scheduler just isn't optimized for locality for small jobs.
You should be able to get substantially more locality even with 1 replica if
you use the fair scheduler, although the version of the scheduler in 0.20
doesn't contain the locality optimization. Try the Clo
On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:
> I agree that the scheduler has lesser leeway when the replication factor is
> 1. However, I would still expect the number of data-local tasks to be more
> than 10% even when the replication factor is 1.
How did you load your data?
I agree that the scheduler has lesser leeway when the replication factor is
1. However, I would still expect the number of data-local tasks to be more
than 10% even when the replication factor is 1. Presumably, the scheduler
would have greater number of opportunities to schedule data-local tasks as
As Aaron mentioned the scheduler has very little leeway when you have a single
replica.
OTOH, schedulers equate rack-locality to node-locality - this makes sense sense
for a large-scale system since intra-rack b/w is good enough for most installs
of Hadoop.
Arun
On Jul 12, 2011, at 7:36 AM, V
?
--Aaron
From: Virajith Jalaparti [mailto:virajit...@gmail.com]
Sent: Tuesday, July 12, 2011 7:37 AM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Lack of data locality in Hadoop-0.20.2
I am using a replication factor of 1 since I dont to incur the o
I am attaching the config files I was using for these runs with this email.
I am not sure if something in them is causing this non-data locality of
Hadoop.
Thanks,
Virajith
On Tue, Jul 12, 2011 at 3:36 PM, Virajith Jalaparti wrote:
> I am using a replication factor of 1 since I dont to incur the
I am using a replication factor of 1 since I dont to incur the overhead of
replication and I am not much worried about reliability.
I am just using the default Hadoop scheduler (FIFO, I think!). In case of a
single rack, rack-locality doesn't really have any meaning. Obviously
everything will run
Why are you running with replication factor of 1?
Also, it depends on the scheduler you are using. The CapacityScheduler in
0.20.203 (not 0.20.2) has much better locality for jobs, similarly with
FairScheduler.
IAC, running on a single rack with replication of 1 implies rack-locality for
all t
Harsh,
I am assuming you mean the web-interface of the jobtracker, right? What I
see there is appended at the end of the email. Is there supposed to be a
counter which is equal to the number of data-local jobs? One obvious way to
find this would be to look at the location of the input split of eac
Virajith,
You can see the number of data local vs. non.'s counters in the job itself.
On Tue, Jul 12, 2011 at 6:36 PM, Virajith Jalaparti
wrote:
> How do I find the number of data-local map tasks that are launched? I
> checked the log files but didnt see any information about this. All the map
>
Each node is configured to run 8map tasks. I am using 2.4 GHz 64-bit Quad
Core Xeon using machines.
-Virajith
On Tue, Jul 12, 2011 at 2:05 PM, Sudharsan Sampath wrote:
> what's the map task capacity of each node ?
>
> On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti
> wrote:
>
>> Hi,
>>
>> I
How do I find the number of data-local map tasks that are launched? I
checked the log files but didnt see any information about this. All the map
tasks are rack local since I am running the job just using a single rack.
>From the completion time per map (comparing it to the case where I have
1Gbps
what's the map task capacity of each node ?
On Tue, Jul 12, 2011 at 6:15 PM, Virajith Jalaparti wrote:
> Hi,
>
> I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
> data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
> size (so 1600maps are created
How much of bandwidth did you see being utilized? What was the count
of number of tasks launched as data-local map tasks versus rack local
ones?
A little bit of edge record data is always read over network but that
is highly insignificant compared to the amount of data read locally (a
whole block
Hi,
I was trying to run the Sort example in Hadoop-0.20.2 over 200GB of input
data using a 20 node cluster of nodes. HDFS is configured to use 128MB block
size (so 1600maps are created) and a replication factor of 1 is being used.
All the 20 nodes are also hdfs datanodes. I was using a bandwidth v
19 matches
Mail list logo