Jobs should not submit the same jar files over and over again
-------------------------------------------------------------
Key: MAPREDUCE-1901
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1901
Project: Hadoop Map/Reduce
Issue Type: Improvement
Reporter: Joydeep Sen Sarma
Currently each Hadoop job uploads the required resources (jars/files/archives)
to a new location in HDFS. Map-reduce nodes involved in executing this job
would then download these resources into local disk.
In an environment where most of the users are using a standard set of jars and
files (because they are using a framework like Hive/Pig) - the same jars keep
getting uploaded and downloaded repeatedly. The overhead of this protocol
(primarily in terms of end-user latency) is significant when:
- the jobs are small (and conversantly - large in number)
- Namenode is under load (meaning hdfs latencies are high and made worse, in
part, by this protocol)
Hadoop should provide a way for jobs in a cooperative environment to not submit
the same files over and again. Identifying and caching execution resources by a
content signature (md5/sha) would be a good alternative to have available.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.