final map output not evenly distributed across multiple disks
-------------------------------------------------------------

                 Key: HADOOP-2437
                 URL: https://issues.apache.org/jira/browse/HADOOP-2437
             Project: Hadoop
          Issue Type: Bug
          Components: mapred
    Affects Versions: 0.16.0
            Reporter: Christian Kunz
            Priority: Blocker
             Fix For: 0.15.2


It seems that the final merge output of map tasks for a particular job does not 
select the output location in random fashion.

This results in a job with a lot of map tasks eventually running out of 
taskTrackers asking for more tasks because the disk with most of the map 
outputs eventually has less disk space than specified by 
mapred.local.dir.minspacestart.

Maybe the start of round-robin selection of multiple locations should be 
randomized.

In our case:
110,000 maps, each about 3GB final output, on a 1300 node cluster.
Out of 4 locations and after processing about 79,000 maps, the selection for 
final map outputs 'file.out' looked like:
location1: 24,000
location2: 25
location3: 55,000
location4: 7



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to