[ 
https://issues.apache.org/jira/browse/YARN-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16316777#comment-16316777
 ] 

Jason Lowe commented on YARN-2185:
----------------------------------

Thanks for the patch!

This patch adds parallel copying of directories and ability to ignore 
timestamps which IMHO is outside the scope of this patch.  I think those should 
be handled in separate JIRAs with appropriate unit tests.

Do we really want a while loop to pull off extensions?  That's not how it 
behaved before, and it will now do different, potentially unexpected, things 
for a ".jar.gz", ".tgz.tgz", ".gz.jar", etc.  Granted these are unlikely 
filenames to encounter, but if somehow a user had those before and were 
working, this will change that behavior.

Does it make sense to use an external process to unpack .jar and .zip files?  
The original code did this via the JDK directly rather than requiring a 
subprocess, and given there's a ZipInputStream and JarInputStream, it seems we 
could do the same here with the input stream rather than requiring the overhead 
of a subprocess and copying of data between streams.

This now logs an INFO message for all stderr and stdout for the subprocess 
(i.e.: at least every entry in the archive), which is not something it did 
before.  This should be debug, if logged at all, and it requires buffering all 
of the output.  Some of our users have some crazy use cases with very 
large/deep tarballs, and grabbing all the output from tar would not be a 
trivial amount of memory for the NM or a localizer heap, especially if there 
are multiple of these at the same time.  If the log message is kept, the 
verbose flag should be passed to the command only if the output will be logged.

I think it would be better to explicitly create the directory from the Java 
code rather than squirrel it into the tar command.  We can provide a better 
error message with more context if done directly, and as it is now the mkdir 
can fail but blindlly proceed to the tar command.

Wouldn't "cd _destpath_ && tar -xf" be better than "cd _destpath_; tar -xf"?  
Otherwise if somehow the change directory fails it will end up unpacking to an 
unexpected place.

Is the "rm -rf" necessary?

Has this been tested on Windows?  The zipOrJar path appears to be eligible for 
Windows but uses shell syntax that {{cmd}} is not going to understand, like 
semicolons between commands.

Does the single-quote escaping work properly?  Escaping double quotes with a 
backslash works within a double-quoted string, but the same is not true for 
single-quoted strings.  The shell does not interpret any characters in a 
single-quoted string, even backslash characters.  Therefore the only way I know 
of to put a single-quote character in a single-quoted string is to terminate 
the current string, insert an escaped single quote (since we're not in a 
single-quote parsing context at this point), then start the single-quote string 
up again.  In other words, each {{'}} character trying to be escaped within a 
single-quote context becomes the four character sequence: {{'\''}}  For 
example, single-quoting {{abc'def'ghi}} becomes {{'abc'\''def'\''ghi'}}.

I think we should shutdown the executor service if we created it, otherwise we 
risk running out of threads before the garbage collector gets around to 
noticing it can free everything up.  Either that or require the caller to 
provide the executor.  Note that if the caller provides a single thread 
executor (or some other executor that can only execute serially) then this can 
deadlock if the process emits a lot of output on stderr.  The thread trying to 
consume stdout data is blocked until the child process completes, but the child 
process is blocked trying to emit stderr output that the parent process is not 
consuming.  I wonder if it would be better to add a method to Shell that takes 
an input stream so we could reuse all the subprocess handling code there.

This debug log was upgraded to an info, yet it still has the debug check in 
front of it.  I'm not sure this should be upgraded to an info log as part of 
this change.
{code}
    if (LOG.isDebugEnabled()) {
      LOG.info(String.format("Starting to download %s %s %s",
          sCopy,
          resource.getType(),
          resource.getPattern()));
    }
{code}

I don't see how the following code treats PATTERN files that don't end in 
'.jar' like ARCHIVE files since the resource type check will still find PATTERN 
instead of ARCHIVE when it falls through from the first {{if}} to the second:
{code}
      if (resource.getType() == LocalResourceType.PATTERN) {
        if (destinationFile.endsWith(".jar")) {
          // Unpack and keep a copy of the whole jar for mapreduce
          String p = resource.getPattern();
          RunJar.unJarAndSave(inputStream, new File(destination.toUri()),
              source.getName(),
              p == null ? RunJar.MATCH_ANY : Pattern.compile(p));
          return;
        } else {
          LOG.warn("Treating [" + source + "] " +
              "as an archive even though it " +
              "was specified as PATTERN");
        }
      }
      if (resource.getType() == LocalResourceType.ARCHIVE) {
        if (FileUtil.decompress(
            inputStream, source.getName(), destination, executor)) {
          return;
        } else {
          LOG.warn("Cannot unpack [" + source + "] trying to copy");
        }
      }
{code}

Is there a reason to implement the copy loop directly in the new unJar method 
rather than calling copyBytes as the other one does?


> Use pipes when localizing archives
> ----------------------------------
>
>                 Key: YARN-2185
>                 URL: https://issues.apache.org/jira/browse/YARN-2185
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.4.0
>            Reporter: Jason Lowe
>            Assignee: Miklos Szegedi
>         Attachments: YARN-2185.000.patch, YARN-2185.001.patch, 
> YARN-2185.002.patch
>
>
> Currently the nodemanager downloads an archive to a local file, unpacks it, 
> and then removes it.  It would be more efficient to stream the data as it's 
> being unpacked to avoid both the extra disk space requirements and the 
> additional disk activity from storing the archive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to