[ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arun C Murthy reopened MAPREDUCE-1220: -------------------------------------- > Implement an in-cluster LocalJobRunner > -------------------------------------- > > Key: MAPREDUCE-1220 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: client, jobtracker > Reporter: Arun C Murthy > Assignee: Greg Roelofs > Fix For: 0.23.0 > > Attachments: MAPREDUCE-1220_yhadoop20.patch, > MR-1220.v1.trunk-hadoop-common.Progress-dumper.patch.txt, > MR-1220.v10e-v11c-v12b.ytrunk-hadoop-mapreduce.delta.patch.txt, > MR-1220.v13.ytrunk-hadoop-mapreduce.delta.patch.txt, > MR-1220.v14b.ytrunk-hadoop-mapreduce.delta.patch.txt, > MR-1220.v15.ytrunk-hadoop-mapreduce.delta.patch.txt, > MR-1220.v1b.sshot-02-jobdetails.jsp.png, > MR-1220.v1b.sshot-03-jobdetailshistory.jsp.png, > MR-1220.v2.trunk-hadoop-mapreduce.patch.txt, > MR-1220.v2.trunk-hadoop-mapreduce.patch.txt, > MR-1220.v2b.sshot-01-jobtracker.jsp.png, > MR-1220.v6.ytrunk-hadoop-mapreduce.patch.txt, > MR-1220.v7.ytrunk-hadoop-mapreduce.delta.patch.txt, > MR-1220.v8b.ytrunk-hadoop-mapreduce.delta.patch.txt, > MR-1220.v9c.ytrunk-hadoop-mapreduce.delta.patch.txt > > > Currently very small map-reduce jobs suffer from latency issues due to > overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've > periodically tried to optimize all parts of framework to achieve lower > latencies. > I'd like to turn the problem around a little bit. I propose we allow very > small jobs to run as a single task job with multiple maps and reduces i.e. > similar to our current implementation of the LocalJobRunner. Thus, under > certain conditions (maybe user-set configuration, or if input data is small > i.e. less a DFS blocksize) we could launch a special task which will run all > maps in a serial manner, followed by the reduces. This would really help > small jobs achieve significantly smaller latencies, thanks to lesser > scheduling overhead, jvm startup, lack of shuffle over the network etc. > This would be a huge benefit, especially on large clusters, to small Hive/Pig > queries. > Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira