[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-28 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604902#comment-14604902
 ] 

Gopal V commented on HIVE-11043:


[~leftylev]: yes, it needs doc - I will write up a "decision" tree of the 
hybrid strategy for the docs.

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch, 
> HIVE-11043.3.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-28 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14604841#comment-14604841
 ] 

Lefty Leverenz commented on HIVE-11043:
---

Does this need documentation?

Also, shouldn't Fix Version include 1.3.0 (commit 
64f8e0f069f71f82518a9280d199f790174bee33 to branch-1)?

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch, 
> HIVE-11043.3.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-25 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600875#comment-14600875
 ] 

Hive QA commented on HIVE-11043:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12741727/HIVE-11043.3.patch

{color:red}ERROR:{color} -1 due to 1 failed/errored test(s), 9025 tests executed
*Failed tests:*
{noformat}
org.apache.hive.hcatalog.streaming.TestStreaming.testAddPartition
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4377/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4377/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4377/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 1 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12741727 - PreCommit-HIVE-TRUNK-Build

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch, 
> HIVE-11043.3.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-24 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14600320#comment-14600320
 ] 

Gopal V commented on HIVE-11043:


Filed HIVE-11102 to track the issue - this error is not related to this patch, 
but has been exposed by this patch (i.e picking ETL Strategy instead of BI).

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-23 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598893#comment-14598893
 ] 

Gopal V commented on HIVE-11043:


[~prasanth_j]: sure, looks like errors when reading footers for the 1 file/1 
split case.

The error is actually

{code}
Caused by: java.lang.IndexOutOfBoundsException: Index: 0
at java.util.Collections$EmptyList.get(Collections.java:3212)
at 
org.apache.hadoop.hive.ql.io.orc.OrcProto$Type.getSubtypes(OrcProto.java:12240)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getColumnIndicesFromNames(ReaderImpl.java:651)
at 
org.apache.hadoop.hive.ql.io.orc.ReaderImpl.getRawDataSizeOfColumns(ReaderImpl.java:634)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.populateAndCacheStripeDetails(OrcInputFormat.java:938)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:847)
at 
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$SplitGenerator.call(OrcInputFormat.java:713)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-23 Thread Pengcheng Xiong (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14598799#comment-14598799
 ] 

Pengcheng Xiong commented on HIVE-11043:


[~prasanth_j] and [~gopalv], [~jpullokkaran] asked me to track the recent 
constant test cases failing on master and I came here. It seems that this patch 
causes the problem. At the first sight, authorization_delete.q sounds 
unrelated. However, it includes creating a table stored as ORC. If I revert 
this patch, the test cases can pass. Could you guys take a look? Thanks.

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-23 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597295#comment-14597295
 ] 

Prasanth Jayachandran commented on HIVE-11043:
--

LGTM, +1. I don't think the test failures are related.

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-23 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14597289#comment-14597289
 ] 

Hive QA commented on HIVE-11043:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12741166/HIVE-11043.2.patch

{color:red}ERROR:{color} -1 due to 7 failed/errored test(s), 9014 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_delete_own_table
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_authorization_update_own_table
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_outer_join1
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_outer_join4
org.apache.hive.hcatalog.pig.TestHCatStorer.testEmptyStore[3]
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4345/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4345/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4345/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 7 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12741166 - PreCommit-HIVE-TRUNK-Build

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch, HIVE-11043.2.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-22 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14596843#comment-14596843
 ] 

Gopal V commented on HIVE-11043:


bq. 3) ... In which case we will end up using BI as default even though there 
are only small number of files.
bq. 5) Should we make this independently configurable? Instead of using the 
cache max size.

The max cache size is a safety limit for huge clusters, it is not a 
configuration requirement.

If you need to change the behaviour explicitly, the right config to change is 
the strategy used (between ETL/BI) to select whichever one's the preferred one.

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-19 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593575#comment-14593575
 ] 

Prasanth Jayachandran commented on HIVE-11043:
--

Mostly looks good. 
Few questions/comments:
1) Can we use the same default for numSplits as MR? 1 instead of -1. This will 
make ETL strategy the default even in the presence of single small file.
{code}
return generateSplitsInfo(conf, -1);
{code}
2) The condition should be numFiles <= context.minSplits right? This will avoid 
choosing BI in the case of 1 small file.
3) I tried some queries and numSplits arg in getSplits() can become 0. In which 
case we will end up using BI as default even though there are only small number 
of files.
4) Some more tests for these corner cases will be helpful.
5) Should we make this independently configurable? Instead of using the cache 
max size.

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-11043) ORC split strategies should adapt based on number of files

2015-06-18 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-11043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593104#comment-14593104
 ] 

Hive QA commented on HIVE-11043:




{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12740561/HIVE-11043.1.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 9010 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_vector_outer_join4
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join28
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4317/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/4317/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-4317/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12740561 - PreCommit-HIVE-TRUNK-Build

> ORC split strategies should adapt based on number of files
> --
>
> Key: HIVE-11043
> URL: https://issues.apache.org/jira/browse/HIVE-11043
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Prasanth Jayachandran
>Assignee: Gopal V
> Fix For: 2.0.0
>
> Attachments: HIVE-11043.1.patch
>
>
> ORC split strategies added in HIVE-10114 chose strategies based on average 
> file size. It would be beneficial to choose a different strategy based on 
> number of files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)