Denes Bodo created OOZIE-3227: --------------------------------- Summary: Eliminate duplicated dependencies from distributed cache Key: OOZIE-3227 URL: https://issues.apache.org/jira/browse/OOZIE-3227 Project: Oozie Issue Type: Sub-task Components: core Affects Versions: 5.0.0 Reporter: Denes Bodo Assignee: Denes Bodo
Using Hadoop 3 it is not allowed to have multiple dependencies with same file names on the list of *mapreduce.job.cache.files*. The issue occurs when I have the same file name on multiple sharelib folders and/or my application's lib folder. This can be avoided but not easy all the time. I suggest to remove the duplicates from this list. A quick workaround for the source code in JavaActionExecutor is like: {code} removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.files"); removeDuplicatedDependencies(launcherJobConf, "mapreduce.job.cache.archives"); ...... private void removeDuplicatedDependencies(JobConf conf, String key) { final Map<String, String> nameToPath = new HashMap<>(); StringBuilder uniqList = new StringBuilder(); for(String dependency: conf.get(key).split(",")) { final String[] arr = dependency.split("/"); final String dependencyName = arr[arr.length - 1]; if(nameToPath.containsKey(dependencyName)) { LOG.warn(dependencyName + " [" + dependency + "] is already defined in " + key + ". Skipping..."); } else { nameToPath.put(dependencyName, dependency); uniqList.append(dependency).append(","); } } uniqList.setLength(uniqList.length() - 1); conf.set(key, uniqList.toString()); } {code} Other way is to eliminate the deprecated *org.apache.hadoop.filecache.DistributedCache*. I am going to have a deeper understanding how we should use distributed cache and all the comments are welcome. -- This message was sent by Atlassian JIRA (v7.6.3#76005)