Hi Josh, Copying large number small map outputs can take a while. Can't say why the tasktracker is not running more than one mapper.
We are working on this. hadoop-4565 tracks a jira to create splits that cross files while preserving locality. Hive-74 will use 4565 on hive side to control number of maps better. Joydeep ________________________________ From: Josh Ferguson [mailto:j...@besquared.net] Sent: Monday, January 26, 2009 11:28 PM To: hive-user@hadoop.apache.org Subject: Job Speed So I have a table with roughly 145,000 records spread across 300 files. The total size is about 7MB. Right now I'm running one job tracker and one task tracker which is a high cpu amazon box (1.7 Gbits of RAM, ~ 4 cores). I run the following query: SELECT COUNT(DISTINCT(activities.actor_id)) FROM activities; And it takes about 35 minutes to finish. One of my problems is that I can't get my task tracker to process more than one map at a time even though it has a higher number of maximum map tasks. But even that is relatively fast compared to the reduce which takes about 30 minutes by itself. The status of the task is: reduce > copy (225 of 344 at 0.01 MB/s) > I really don't understand what is going on during this copy step or why it is taking so long. The files are small and they're all inside of amazon's network. Can you guys help me out? Josh F.