[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15981833#comment-15981833 ] Sahil Takiar commented on HIVE-14864: - I updated the documentation for this here: https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryandDDLExecution > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Labels: TODOC2.2 > Fix For: 2.2.0 > > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.4.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906158#comment-15906158 ] Lefty Leverenz commented on HIVE-14864: --- Doc note: This adds *hive.exec.copyfile.maxnumfiles* to HiveConf.java and corrects the description of *hive.exec.copyfile.maxsize* (added in 1.1.0 by HIVE-8750 but not documented yet) so they need to be documented in the wiki. * [Configuration Properties -- Query and DDL Execution | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryandDDLExecution] Added a TODOC2.2 label. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Labels: TODOC2.2 > Fix For: 2.2.0 > > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.4.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15903491#comment-15903491 ] Sergio Peña commented on HIVE-14864: The patch looks good [~stakiar] +1 > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.4.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902953#comment-15902953 ] Hive QA commented on HIVE-14864: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12856936/HIVE-14864.4.patch {color:green}SUCCESS:{color} +1 due to 1 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 10337 tests executed *Failed tests:* {noformat} org.apache.hive.jdbc.TestMultiSessionsHS2WithLocalClusterSpark.testSparkQuery (batchId=219) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/4047/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/4047/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-4047/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12856936 - PreCommit-HIVE-Build > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.4.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15902460#comment-15902460 ] Sahil Takiar commented on HIVE-14864: - [~spena] I worked on this some more, and think a unit test may be better suited for this patch rather than a qtest. There are a number of different queries that could invoke this method (e.g. IMPORT queries use this method too), and more may be added in the future. I added some integration tests that run against a mini HDFS cluster, and some unit tests that just rely on mocking. [~ste...@apache.org] I agree, calling {{getContentSummary}} on S3 will be very slow. I've thought about filing a JIRA for the optimization you mentioned a few times, but never got around to doing it. Fortunately, this specific code won't be hit for S3, only for HDFS. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.4.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15895818#comment-15895818 ] Steve Loughran commented on HIVE-14864: --- {{FileSystem.getContentSummary()}} does a recursive treewalk, so is pathologically bad on a blobstore which has to mock directories through many., many HTTP requests. If you need to use it, could you actually supply a patch (+ FS contract tests) for the method so that it uses listFiles(path, recursive=true)? That does the same treewalk against HDFS, but blobstores can do it as an O(1) listing call instead. If you can get that patch in, then enumerating the size of a blobstore tree will be fast > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892913#comment-15892913 ] Sergio Peña commented on HIVE-14864: Didn't the mini HDFS accepts hdfs? Check all the encryption tests, such as encryption_ctas.q. They usually check for the hdfs scheme. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15891510#comment-15891510 ] Sahil Takiar commented on HIVE-14864: - [~spena] am trying to write some qtests for this, do you know which CliDrivers run against HDFS filesystems? When I run the {{TestCliDriver}} all the URIs have schemes of either {{file}} or {{pfile}}. This code path only gets triggered if the scheme is {{hdfs}}. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1550#comment-1550 ] Sergio Peña commented on HIVE-14864: Can we add q-tests for this new variable? > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883969#comment-15883969 ] Hive QA commented on HIVE-14864: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12854547/HIVE-14864.3.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10259 tests executed *Failed tests:* {noformat} TestDerbyConnector - did not produce a TEST-*.xml file (likely timed out) (batchId=235) org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[vector_if_expr] (batchId=140) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query14] (batchId=223) org.apache.hadoop.hive.cli.TestPerfCliDriver.testCliDriver[query23] (batchId=223) {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/3768/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/3768/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-3768/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12854547 - PreCommit-HIVE-Build > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, > HIVE-14864.3.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582459#comment-15582459 ] Sergio Peña commented on HIVE-14864: A couple of comments: * Will srcFS.getContentSummary(src) cause extra time when source is on S3? If so, maybe we want to put this line inside the if() statement. * Once we're here, could you fix the message from HIVE_EXEC_COPYFILE_MAXSIZE to say (in Bytes) instead of (in Mb) ? > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15577556#comment-15577556 ] Hive QA commented on HIVE-14864: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12833472/HIVE-14864.2.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10564 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_globallimit] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[order_null] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[union_fast_stats] org.apache.hadoop.hive.cli.TestHBaseCliDriver.testCliDriver[hbase_bulk] org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJarWithoutAddDriverClazz[0] org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[0] org.apache.hive.beeline.TestBeelineArgParsing.testAddLocalJar[1] org.apache.hive.jdbc.authorization.TestJdbcWithSQLAuthorization.testBlackListedUdfUsage {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1581/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1581/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1581/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 8 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12833472 - PreCommit-HIVE-Build > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Sahil Takiar > Attachments: HIVE-14864.1.patch, HIVE-14864.2.patch, HIVE-14864.patch > > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573955#comment-15573955 ] Hive QA commented on HIVE-14864: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12833234/HIVE-14864.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 89 failed/errored test(s), 10534 tests executed *Failed tests:* {noformat} TestMiniLlapLocalCliDriver - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[autoColumnStats_7] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[crtseltbl_serdeprops] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cte_3] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cte_4] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cte_mat_4] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[cte_mat_5] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_00_nonpart_empty] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_01_nonpart] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_02_00_part_empty] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_02_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_03_nonpart_over_compat] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_all_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_04_evolved_parts] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_05_some_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_06_one_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_07_all_part_over_nonoverlap] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_08_nonpart_rename] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_09_part_spec_nonoverlap] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_10_external_managed] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_11_managed_external] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_12_external_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_13_managed_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_14_managed_location_over_existing] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_15_external_part] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_16_part_external] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_17_part_managed] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_18_part_external] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_19_00_part_external_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_19_part_external_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_20_part_managed_location] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_21_export_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_22_import_exist_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_23_import_part_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_24_import_nonexist_authsuccess] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_25_export_parentpath_has_inaccessible_children] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[exim_hidden_files] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[foldts] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[masking_9] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[materialized_view_describe] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[reloadJar] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[repl_2_exim_basic] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[repl_3_exim_metadata] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[stats_partial_size] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[temp_table] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[temp_table_gb1] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[temp_table_join1] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[temp_table_subquery1] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_tablesample_rows] org.apache.hadoop.hive.cli.TestEncryptedHDFSCliDriver.testCliDriver[encryption_load_data_to_encrypted_tables] org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver[import_exported_table] org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver[import_exported_table] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[authorization_import] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[ctas_noemptyfolder] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_01_nonpart_over_loaded] org.apache.hadoop.hive.cli.TestNegativeCliDriver.testCliDriver[exim_02_all_part_over_overlap]
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536989#comment-15536989 ] Sergio Peña commented on HIVE-14864: I run some tests with directories, and distcp is indeed faster than copying files one at a time. I don't know the threshold, we should investigate with different files to see it. I think this should be included in C5.10 even if we don't use it for S3. Encryption directories on HDFS will benefit from this too. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536925#comment-15536925 ] Vihang Karajgaonkar commented on HIVE-14864: Adding [~spena] since he has worked on this before. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536922#comment-15536922 ] Vihang Karajgaonkar commented on HIVE-14864: I looked into distcp documentation and I think you are right. It creates a list of files to be copied if src is a directory and this list is divided among a bunch of CopyMappers to do the actual copy. Do you have any ideas as to how should we evaluate the threshold of number of files when distcp is beneficial? > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536870#comment-15536870 ] Vihang Karajgaonkar commented on HIVE-14864: Unfortunately, the documentation of ContentSummary.getLength() says ... returns the length :) This is the implementation of getContentSummary() in FileSystem.java which suggests getLength is actually returning the sum of lengths of all the files within that directory. {noformat} public ContentSummary getContentSummary(Path f) throws IOException { FileStatus status = getFileStatus(f); if (status.isFile()) { // f is a file return new ContentSummary(status.getLen(), 1, 0); } // f is a directory long[] summary = {0, 0, 1}; for(FileStatus s : listStatus(f)) { ContentSummary c = s.isDirectory() ? getContentSummary(s.getPath()) : new ContentSummary(s.getLen(), 1, 0); summary[0] += c.getLength(); summary[1] += c.getFileCount(); summary[2] += c.getDirectoryCount(); } return new ContentSummary(summary[0], summary[1], summary[2]); } {noformat} These are the revelant constructors for ContentSummary. {noformat} /** Constructor */ public ContentSummary(long length, long fileCount, long directoryCount) { this(length, fileCount, directoryCount, -1L, length, -1L); } public ContentSummary( long length, long fileCount, long directoryCount, long quota, long spaceConsumed, long spaceQuota) { this.length = length; this.fileCount = fileCount; this.directoryCount = directoryCount; this.quota = quota; this.spaceConsumed = spaceConsumed; this.spaceQuota = spaceQuota; } {noformat} > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14864) Distcp is not called from MoveTask when src is a directory
[ https://issues.apache.org/jira/browse/HIVE-14864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15536842#comment-15536842 ] Mohit Sabharwal commented on HIVE-14864: Does srcFS.getContentSummary(src).getLength() return total number of files in the directory ? I think this whole condition needs to be re-thought. Because AFAIK, DistCp speeds up copies when multiple files are involved. There is no advantage in DistCp'ing a single file, no matter how big that file. Which means HIVE_EXEC_COPYFILE_MAXSIZE does not make sense even for a file. IOW, we need to look into DistCp'ing directories only, and possibly ones which contain more than a threshold number of files. > Distcp is not called from MoveTask when src is a directory > -- > > Key: HIVE-14864 > URL: https://issues.apache.org/jira/browse/HIVE-14864 > Project: Hive > Issue Type: Bug >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar > > In FileUtils.java the following code does not get executed even when src > directory size is greater than HIVE_EXEC_COPYFILE_MAXSIZE because > srcFS.getFileStatus(src).getLen() returns 0 when src is a directory. We > should use srcFS.getContentSummary(src).getLength() instead. > {noformat} > /* Run distcp if source file/dir is too big */ > if (srcFS.getUri().getScheme().equals("hdfs") && > srcFS.getFileStatus(src).getLen() > > conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE)) { > LOG.info("Source is " + srcFS.getFileStatus(src).getLen() + " bytes. > (MAX: " + conf.getLongVar(HiveConf.ConfVars.HIVE_EXEC_COPYFILE_MAXSIZE) + > ")"); > LOG.info("Launch distributed copy (distcp) job."); > HiveConfUtil.updateJobCredentialProviders(conf); > copied = shims.runDistCp(src, dst, conf); > if (copied && deleteSource) { > srcFS.delete(src, true); > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)