[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278836#comment-14278836 ] Hive QA commented on HIVE-9367: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12692326/HIVE-9367.2.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 7311 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_stats_counter org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union13 org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2373/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2373/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2373/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12692326 - PreCommit-HIVE-TRUNK-Build CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279798#comment-14279798 ] Xuefu Zhang commented on HIVE-9367: --- [~jxiang], are the failures related to your patch? CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279803#comment-14279803 ] Jimmy Xiang commented on HIVE-9367: --- Looked into them, not related. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277679#comment-14277679 ] Jimmy Xiang commented on HIVE-9367: --- Sure, will remove it in next patch. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277642#comment-14277642 ] Xuefu Zhang commented on HIVE-9367: --- Thanks for the explanation. This is a shim class, so we are okay. Patch looks good to me. One note though, is that prune() method seems no longer needed. Could you remove it? CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277931#comment-14277931 ] Xuefu Zhang commented on HIVE-9367: --- +1 pending on test CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277070#comment-14277070 ] Xuefu Zhang commented on HIVE-9367: --- Nice improvement. However, I'm a little concerned about overriding listStatus() method, as an caller (including subclasses) would suddently get a list with folders excluded. I'm wondering if it's possible to achieve the same optimization w/o overriding that method. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277208#comment-14277208 ] Jimmy Xiang commented on HIVE-9367: --- So far, I don't find such a subclass/caller. Without overriding that method, we may need to enhance MR code a little, for example, adding a new API/setting, which is not practical. Probably for now, overriding the method is what we can do. Thanks. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276414#comment-14276414 ] Rui Li commented on HIVE-9367: -- Hi [~jxiang], could you elaborate a little how this will avoid the expensive calls? Seems we still have to iterate all the file statuses to check if it's a directory? CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276435#comment-14276435 ] Jimmy Xiang commented on HIVE-9367: --- With the FileStatus, we don't need to go to NN to get the FileStatus again, since FileStatus already has info about if the path is a file or dir. Originally, in getDirIndices, we get FileStatus again, which is an extra call for each file. So this patch saves us a call to get FileStatus for each file. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276444#comment-14276444 ] Rui Li commented on HIVE-9367: -- I see. Thanks [~jxiang] for the explanation! CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive
[ https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276527#comment-14276527 ] Rui Li commented on HIVE-9367: -- I just verified the patch here can reduce the getSplits time from 1s to less than 200ms. The test table consists of one 100GB sequence file. CombineFileInputFormatShim#getDirIndices is expensive - Key: HIVE-9367 URL: https://issues.apache.org/jira/browse/HIVE-9367 Project: Hive Issue Type: Improvement Reporter: Jimmy Xiang Assignee: Jimmy Xiang Attachments: HIVE-9367.1.patch [~lirui] found out that we spent quite some time on CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me we should be able to get rid of this method completely if we can enhance CombineFileInputFormatShim a little. -- This message was sent by Atlassian JIRA (v6.3.4#6332)