[ https://issues.apache.org/jira/browse/YARN-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251911#comment-15251911 ]
Daniel Templeton commented on YARN-4958: ---------------------------------------- This JIRA is in the same space as HADOOP-12747. It's solving the same problem in a completely different way, with different side-effects. I think there is room for and value in both JIRAs. > The file localization process should allow for wildcards to reduce the > application footprint in the state store > --------------------------------------------------------------------------------------------------------------- > > Key: YARN-4958 > URL: https://issues.apache.org/jira/browse/YARN-4958 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.8.0 > Reporter: Daniel Templeton > Assignee: Daniel Templeton > Priority: Critical > Attachments: YARN-4958.001.patch > > > When using the -libjars option to add classes to the classpath, every library > so added is explicitly listed in the {{ContainerLaunchContext}}'s local > resources even though they're all uploaded to the same directory in HDFS. > When using tools like Crunch without an uber JAR or when trying to take > advantage of the shared cache, the number of libraries can be quite large. > We've seen many cases where we had to turn down the max number of > applications to prevent ZK from running out of heap because of the size of > the state store entries. > Rather than listing all files independently, this JIRA proposes to have the > NM allow wildcards in the resource localization paths. Specifically, we > propose to allow a path to have a final component (name) set to "*", which is > interpreted by the NM as "download the fell directory and link to every file > in it from the job's working directory." This behavior is the same as the > current behavior when using -libjars, but avoids explicitly listing every > file. > This JIRA does not attempt to provide more general purpose wildcards, such as > "*.jar" or "file*", as having multiple entries for a single directory > presents numerous logistical issues. > This JIRA also does not attempt to integrate with the shared cache. That > work will be left to a future JIRA. > This JIRA proposes to allow for wildcards both in the internal processing of > the -libjars switch and in paths added through the {{Job}} and > {{DistributedCache}} classes. > The proposed approach is to treat a path, "dir/*", as "dir" for purposes of > all file verification. In the final step, the NM will query the localized > directory to get a list of the files in "dir" such that each can be linked > from the job's working directory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)