[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-31 Thread Prasanth J (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888013#comment-13888013
 ] 

Prasanth J commented on HIVE-6287:
--

The test failures seems to be unrelated.

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
> HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 1. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |- STRIPE 1 
> |
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
> |- diskrange 1 -|   |- diskrange 
> 2 -|
> ^
>  (marker)   
> {code}
> diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 2 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-31 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887689#comment-13887689
 ] 

Hive QA commented on HIVE-6287:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12626157/HIVE-6287.4.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 4981 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket_num_reducers
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1126/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1126/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12626157

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
> HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 1. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |- STRIPE 1 
> |
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
> |- diskrange 1 -|   |- diskrange 
> 2 -|
> ^
>  (marker)   
> {code}
> diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 2 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886446#comment-13886446
 ] 

Hive QA commented on HIVE-6287:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12625925/HIVE-6287.3.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 4973 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_vectorization_ppd
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1109/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1109/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12625925

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
> HIVE-6287.3.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 1. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |- STRIPE 1 
> |
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
> |- diskrange 1 -|   |- diskrange 
> 2 -|
> ^
>  (marker)   
> {code}
> diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 2 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-28 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884895#comment-13884895
 ] 

Gunther Hagleitner commented on HIVE-6287:
--

Assuming tests are passing: +1 LGTM

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, 
> HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 1. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |- STRIPE 1 
> |
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
> |- diskrange 1 -|   |- diskrange 
> 2 -|
> ^
>  (marker)   
> {code}
> diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 2 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-25 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882099#comment-13882099
 ] 

Hive QA commented on HIVE-6287:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12625092/HIVE-6287.2.patch

{color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 4959 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_vectorization_ppd
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_import_exported_table
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_reducers_power_two
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_load_hdfs_file_with_space_in_the_name
org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_file_with_header_footer_negative
{noformat}

Test results: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1025/testReport
Console output: 
http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1025/console

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 5 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12625092

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 1. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |- STRIPE 1 
> |
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
> |- diskrange 1 -|   |- diskrange 
> 2 -|
> ^
>  (marker)   
> {code}
> diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 2 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled

2014-01-24 Thread Eric Hanson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881225#comment-13881225
 ] 

Eric Hanson commented on HIVE-6287:
---

I think that by PPD you mean predicate pushdown. This was not immediately 
obvious to me. I edited it into the description. It's a good idea to define 
acronyms on first use. Thanks!

> batchSize computation in Vectorized ORC reader can cause 
> BufferUnderFlowException when PPD is enabled
> -
>
> Key: HIVE-6287
> URL: https://issues.apache.org/jira/browse/HIVE-6287
> Project: Hive
>  Issue Type: Bug
>  Components: Vectorization
>Affects Versions: 0.13.0
>Reporter: Prasanth J
>Assignee: Prasanth J
>  Labels: orcfile, vectorization
> Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch
>
>
> nextBatch() method that computes the batchSize is only aware of stripe 
> boundaries. This will not work when predicate pushdown (PPD) in ORC is 
> enabled as PPD works at row group level (stripe contains multiple row 
> groups). By default, row group stride is 1. When PPD is enabled, some row 
> groups may get eliminated. After row group elimination, disk ranges are 
> computed based on the selected row groups. If batchSize computation is not 
> aware of this, it will lead to BufferUnderFlowException (reading beyond disk 
> range). Following scenario should illustrate it more clearly
> {code}
> |- STRIPE 1 
> |
> |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 
> --|
> |- diskrange 1 -|   |- diskrange 
> 2 -|
> ^
>  (marker)   
> {code}
> diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since 
> nextBatch() was not aware of row groups and hence the diskranges, it tries to 
> read 1024 values from the end of diskrange 1 where it should only read 2 
> % 1024 = 544 values. This will result in BufferUnderFlowException.
> To fix this, a marker is placed at the end of each range and batchSize is 
> computed accordingly. {code}batchSize = 
> Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - 
> rowInStripe));{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)