[ https://issues.apache.org/jira/browse/MAPREDUCE-1220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Greg Roelofs updated MAPREDUCE-1220: ------------------------------------ Attachment: MR-1220.v2.trunk-hadoop-mapreduce.patch.txt Updated version of Arun's prototype patch; compiles cleanly, but not tested beyond that. > Implement an in-cluster LocalJobRunner > -------------------------------------- > > Key: MAPREDUCE-1220 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1220 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Components: client, jobtracker > Reporter: Arun C Murthy > Assignee: Arun C Murthy > Fix For: 0.22.0 > > Attachments: MAPREDUCE-1220_yhadoop20.patch, > MR-1220.v2.trunk-hadoop-mapreduce.patch.txt > > > Currently very small map-reduce jobs suffer from latency issues due to > overheads in Hadoop Map-Reduce such as scheduling, jvm startup etc. We've > periodically tried to optimize all parts of framework to achieve lower > latencies. > I'd like to turn the problem around a little bit. I propose we allow very > small jobs to run as a single task job with multiple maps and reduces i.e. > similar to our current implementation of the LocalJobRunner. Thus, under > certain conditions (maybe user-set configuration, or if input data is small > i.e. less a DFS blocksize) we could launch a special task which will run all > maps in a serial manner, followed by the reduces. This would really help > small jobs achieve significantly smaller latencies, thanks to lesser > scheduling overhead, jvm startup, lack of shuffle over the network etc. > This would be a huge benefit, especially on large clusters, to small Hive/Pig > queries. > Thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.