[ https://issues.apache.org/jira/browse/MAPREDUCE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13145669#comment-13145669 ]
Robert Joseph Evans commented on MAPREDUCE-3323: ------------------------------------------------ I have read through all of your patches and I have a few comments. # I don't really like the name of current.task.type.internal. It would be better to prefix it with mapreduce. # I think it is slightly faster to change {code}fileURI.toArray(new URI[0]){code} to {code}fileURI.toArray(new URI[fileURI.size()]){code}, but this is just a nit. # There are no tests in the patches. I know you have done some manual testing, but adding/updating the unit tests is important for this to be accepted in. # Have you tested add(Archive|File)ToClassPathFor(Map|Reduce)? They set "mapred.job.classpath.(archives|files)" so if you use these methods some of the entries in "mapred.job.classpath.(archives|files)" will not be valid # Why are you setting CACHE_(FILE|ARCHIVE)_FOR_(MAP|REDUCE)? It seems like you could just go off of the existence of CACHE_(ARCHIVES|FILES)_(MAP|REDUCE). # could you please add in the new user facing configuration keys to mapred-default.xml so that they are documented. > Add new interface for Distributed Cache, which special for Map or Reduce,but > not Both. > --------------------------------------------------------------------------------------- > > Key: MAPREDUCE-3323 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3323 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: distributed-cache, tasktracker > Affects Versions: 0.20.203.0 > Reporter: Azuryy(Chijiong) > Fix For: 0.20.203.0 > > Attachments: DistributedCache.patch, GenericOptionsParser.patch, > JobClient.patch, TaskDistributedCacheManager.patch, TaskTracker.patch > > > We put some file into Distributed Cache, but sometimes, only Map or Reduce > use thses cached files, not useful for both. but TaskTracker always download > cached files from HDFS, if there are some little bit big files in cache, it's > time expensive. > so, this patch add some new API in the DistributedCache.java as follow: > addArchiveToClassPathForMap > addArchiveToClassPathForReduce > addFileToClassPathForMap > addFileToClassPathForReduce > addCacheFileForMap > addCacheFileForReduce > addCacheArchiveForMap > addCacheArchiveForReduce > New API doesn't affect original interface. User can use these features like > the following two methods: > 1) > hadoop job **** -files file1 -mapfiles file2 -reducefiles file3 -archives > arc1 -maparchives arc2 -reduce archives arc3 > 2) > DistributedCache.addCacheFile(conf, file1); > DistributedCache.addCacheFileForMap(conf, file2); > DistributedCache.addCacheFileForReduce(conf, file3); > DistributedCache.addCacheArchives(conf, arc1); > DistributedCache.addCacheArchivesForMap(conf, arc2); > DistributedCache.addCacheFArchivesForReduce(conf, arc3); > These two methods have the same result, That's mean: > You put six files to the distributed cache: file1 ~ file3, arc1 ~ arc3, > but file1 and arc1 are cached for both map and reduce; > file2 and arc2 are only cached for map; > file3 and arc3 are only cached for reduce; -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira