[ https://issues.apache.org/jira/browse/YARN-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16324545#comment-16324545 ]
Jason Lowe commented on YARN-2185: ---------------------------------- Thanks for updating the patch! bq. I was wondering, if it should be better for OS specific archives containing symlinks for example. symlinks is a good example of why it is important to use {{tar}} on Linux. However if we're going to use OS tools to support ZIP files properly then we should not be using the {{jar}} command. {{jar}} does not properly support symlinks in a ZIP archive but {{unzip}} does. To be honest I don't see the appeal of using the jar command for anything since I do not see how it is going to do anything different than the JDK builtin support since it is part of the JDK. As with the parallel copying and timestamp ignore support, I think it may be best to have this patch focus on using pipes rather than also changing the semantics of what tools are used to unpack the archives. We can discuss what should be done with jars and zips in another JIRA. Speaking of symbolic links, now that all supported releases are on JDK7 or later, we can update unTarUsingJava to fix the lacking symlink support. I filed HADOOP-15170 to track that. The failure on the path containing a single quote seems overly paranoid. Single quotes can be escaped in the POSIX shell code path per my previous comment, e.g.: {code} filename.replace("'","'\\''") {code} Rather than do the subshell with an extra {{cd}}, could it be simplified to just {{ tar -C }}? It doesn't look like the following comment was addressed as runCommandOnStream is still creating an executor sometimes yet does not clean it up in those cases. The Javadoc should also be updated to document the deadlock potential. {quote} I think we should shutdown the executor service if we created it, otherwise we risk running out of threads before the garbage collector gets around to noticing it can free everything up. Either that or require the caller to provide the executor. {quote} Speaking of deadlock, if debug logging is not enabled then this can deadlock in the manner I described above. That's one of the reasons why I think leveraging Shell isn't a terrible idea. Wielding ProcessBuilder and Process properly is trickier than most people believe. It also side-steps a lot of the executor management stuff. I'm OK if you still think it's better to roll our own here. "parallelVerifyAndCopy" is no longer parallel so the name and javadoc is misleading. It also does not throw InterruptedException nor ExecutionException. It's weird to allow the user to specify a downloader thread count yet we setup and teardown the executor for every unpack call and the unpack call is only going to use two threads. Essentially it shouldn't be a config since any value but 2 is either wrong or wasteful. Speaking of executors, would it makes more sense for FSDownload to be able to leverage an existing executor rather than creating its own? All of the existing usages of FSDownload are using an executor of one form or another, so it'd be nice not to have to keep building and tearing these down for every call. I'm still confuse about the recursive destination directory delete that was added. It was not there before, and I don't think it's appropriate here. What are we worried is going to be there, and if somehow something is there, are we sure it's OK to just blindly remove it? destionationTmp s/b destinationTmp FSDownloadException is no longer needed. > Use pipes when localizing archives > ---------------------------------- > > Key: YARN-2185 > URL: https://issues.apache.org/jira/browse/YARN-2185 > Project: Hadoop YARN > Issue Type: Improvement > Components: nodemanager > Affects Versions: 2.4.0 > Reporter: Jason Lowe > Assignee: Miklos Szegedi > Attachments: YARN-2185.000.patch, YARN-2185.001.patch, > YARN-2185.002.patch, YARN-2185.003.patch, YARN-2185.004.patch > > > Currently the nodemanager downloads an archive to a local file, unpacks it, > and then removes it. It would be more efficient to stream the data as it's > being unpacked to avoid both the extra disk space requirements and the > additional disk activity from storing the archive. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org