Hi Josh,

Copying large number small map outputs can take a while. Can't say why the 
tasktracker is not running more than one mapper.

We are working on this. hadoop-4565 tracks a jira to create splits that cross 
files while preserving locality. Hive-74 will use 4565 on hive side to control 
number of maps better.

Joydeep

________________________________
From: Josh Ferguson [mailto:j...@besquared.net]
Sent: Monday, January 26, 2009 11:28 PM
To: hive-user@hadoop.apache.org
Subject: Job Speed

So I have a table with roughly 145,000 records spread across 300 files. The 
total size is about 7MB. Right now I'm running one job tracker and one task 
tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run the 
following query:

SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities;

And it takes about 35 minutes to finish. One of my problems is that I can't get 
my task tracker to process more than one map at a time even though it has a 
higher number of maximum map tasks. But even that is relatively fast compared 
to the reduce which takes about 30 minutes by itself. The status of the task is:

reduce > copy (225 of 344 at 0.01 MB/s) >


I really don't understand what is going on during this copy step or why it is 
taking so long. The files are small and they're all inside of amazon's network. 
Can you guys help me out?


Josh F.

Reply via email to