[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-15 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278836#comment-14278836
 ] 

Hive QA commented on HIVE-9367:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12692326/HIVE-9367.2.patch

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 7311 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_stats_counter
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_union13
org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2373/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2373/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2373/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12692326 - PreCommit-HIVE-TRUNK-Build

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-15 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279798#comment-14279798
 ] 

Xuefu Zhang commented on HIVE-9367:
---

[~jxiang], are the failures related to your patch?

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-15 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279803#comment-14279803
 ] 

Jimmy Xiang commented on HIVE-9367:
---

Looked into them, not related.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-14 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277679#comment-14277679
 ] 

Jimmy Xiang commented on HIVE-9367:
---

Sure, will remove it in next patch.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277642#comment-14277642
 ] 

Xuefu Zhang commented on HIVE-9367:
---

Thanks for the explanation. This is a shim class, so we are okay. Patch looks 
good to me. One note though, is that prune() method seems no longer needed. 
Could you remove it? 

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277931#comment-14277931
 ] 

Xuefu Zhang commented on HIVE-9367:
---

+1 pending on test

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch, HIVE-9367.2.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277070#comment-14277070
 ] 

Xuefu Zhang commented on HIVE-9367:
---

Nice improvement. However, I'm a little concerned about overriding listStatus() 
method, as an caller (including subclasses) would suddently get a list with 
folders excluded. I'm wondering if it's possible to achieve the same 
optimization w/o overriding that method.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-14 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277208#comment-14277208
 ] 

Jimmy Xiang commented on HIVE-9367:
---

So far, I don't find such a subclass/caller. Without overriding that method, we 
may need to enhance MR code a little, for example, adding a new API/setting, 
which is not practical. Probably for now, overriding the method is what we can 
do. Thanks.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-13 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276414#comment-14276414
 ] 

Rui Li commented on HIVE-9367:
--

Hi [~jxiang], could you elaborate a little how this will avoid the expensive 
calls? Seems we still have to iterate all the file statuses to check if it's a 
directory?

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-13 Thread Jimmy Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276435#comment-14276435
 ] 

Jimmy Xiang commented on HIVE-9367:
---

With the FileStatus, we don't need to go to NN to get the FileStatus again, 
since FileStatus already has info about if the path is a file or dir. 
Originally, in getDirIndices, we get FileStatus again, which is an extra call 
for each file. So this patch saves us a call to get FileStatus for each file.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-13 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276444#comment-14276444
 ] 

Rui Li commented on HIVE-9367:
--

I see. Thanks [~jxiang] for the explanation!

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9367) CombineFileInputFormatShim#getDirIndices is expensive

2015-01-13 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276527#comment-14276527
 ] 

Rui Li commented on HIVE-9367:
--

I just verified the patch here can reduce the getSplits time from 1s to less 
than 200ms. The test table consists of one 100GB sequence file.

 CombineFileInputFormatShim#getDirIndices is expensive
 -

 Key: HIVE-9367
 URL: https://issues.apache.org/jira/browse/HIVE-9367
 Project: Hive
  Issue Type: Improvement
Reporter: Jimmy Xiang
Assignee: Jimmy Xiang
 Attachments: HIVE-9367.1.patch


 [~lirui] found out that we spent quite some time on 
 CombineFileInputFormatShim#getDirIndices. Looked into it and it seems to me 
 we should be able to get rid of this method completely if we can enhance 
 CombineFileInputFormatShim a little.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)