[ https://issues.apache.org/jira/browse/OOZIE-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16449335#comment-16449335 ]
Andras Piros edited comment on OOZIE-3227 at 4/24/18 5:57 AM: -------------------------------------------------------------- [~dionusos] to me it seems quite dangerous to remove any JARs that come w/ any sharelib or user provided folders, resources. Using proper symlinking instead seems a better idea. To have a better understanding, can you please show a minimal reproduction case: * {{workflow.xml}} and {{job.properties}} to see which sharelibs are used * user provided {{lib}} folder, as well as any JARs and / or folders provided * which Hadoop version you use to build Oozie with, as well as which Hadoop version Oozie tries to connect at runtime Thanks! was (Author: andras.piros): [~dionusos] to me it seems quite dangerous to remove any JARs that come w/ any sharelib or user provided folders, resources. To have a better understanding, can you please show a minimal reproduction case: * {{workflow.xml}} and {{job.properties}} to see which sharelibs are used * user provided {{lib}} folder, as well as any JARs and / or folders provided * which Hadoop version you use to build Oozie with, as well as which Hadoop version Oozie tries to connect at runtime Thanks! > Eliminate duplicated dependencies from distributed cache > -------------------------------------------------------- > > Key: OOZIE-3227 > URL: https://issues.apache.org/jira/browse/OOZIE-3227 > Project: Oozie > Issue Type: Sub-task > Components: core > Affects Versions: 5.0.0 > Reporter: Denes Bodo > Assignee: Denes Bodo > Priority: Major > > Using Hadoop 3 it is not allowed to have multiple dependencies with same file > names on the list of *mapreduce.job.cache.files*. > The issue occurs when I have the same file name on multiple sharelib folders > and/or my application's lib folder. This can be avoided but not easy all the > time. > I suggest to remove the duplicates from this list. > A quick workaround for the source code in JavaActionExecutor is like: > {code} > removeDuplicatedDependencies(launcherJobConf, > "mapreduce.job.cache.files"); > removeDuplicatedDependencies(launcherJobConf, > "mapreduce.job.cache.archives"); > ...... > private void removeDuplicatedDependencies(JobConf conf, String key) { > final Map<String, String> nameToPath = new HashMap<>(); > StringBuilder uniqList = new StringBuilder(); > for(String dependency: conf.get(key).split(",")) { > final String[] arr = dependency.split("/"); > final String dependencyName = arr[arr.length - 1]; > if(nameToPath.containsKey(dependencyName)) { > LOG.warn(dependencyName + " [" + dependency + "] is already > defined in " + key + ". Skipping..."); > } else { > nameToPath.put(dependencyName, dependency); > uniqList.append(dependency).append(","); > } > } > uniqList.setLength(uniqList.length() - 1); > conf.set(key, uniqList.toString()); > } > {code} > Other way is to eliminate the deprecated > *org.apache.hadoop.filecache.DistributedCache*. > I am going to have a deeper understanding how we should use distributed cache > and all the comments are welcome. -- This message was sent by Atlassian JIRA (v7.6.3#76005)