[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13888013#comment-13888013 ] Prasanth J commented on HIVE-6287: -- The test failures seems to be unrelated. > batchSize computation in Vectorized ORC reader can cause > BufferUnderFlowException when PPD is enabled > - > > Key: HIVE-6287 > URL: https://issues.apache.org/jira/browse/HIVE-6287 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 0.13.0 >Reporter: Prasanth J >Assignee: Prasanth J > Labels: orcfile, vectorization > Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, > HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch > > > nextBatch() method that computes the batchSize is only aware of stripe > boundaries. This will not work when predicate pushdown (PPD) in ORC is > enabled as PPD works at row group level (stripe contains multiple row > groups). By default, row group stride is 1. When PPD is enabled, some row > groups may get eliminated. After row group elimination, disk ranges are > computed based on the selected row groups. If batchSize computation is not > aware of this, it will lead to BufferUnderFlowException (reading beyond disk > range). Following scenario should illustrate it more clearly > {code} > |- STRIPE 1 > | > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 > --| > |- diskrange 1 -| |- diskrange > 2 -| > ^ > (marker) > {code} > diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since > nextBatch() was not aware of row groups and hence the diskranges, it tries to > read 1024 values from the end of diskrange 1 where it should only read 2 > % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is > computed accordingly. {code}batchSize = > Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - > rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887689#comment-13887689 ] Hive QA commented on HIVE-6287: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12626157/HIVE-6287.4.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 4981 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket_num_reducers {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1126/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1126/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12626157 > batchSize computation in Vectorized ORC reader can cause > BufferUnderFlowException when PPD is enabled > - > > Key: HIVE-6287 > URL: https://issues.apache.org/jira/browse/HIVE-6287 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 0.13.0 >Reporter: Prasanth J >Assignee: Prasanth J > Labels: orcfile, vectorization > Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, > HIVE-6287.3.patch, HIVE-6287.4.patch, HIVE-6287.WIP.patch > > > nextBatch() method that computes the batchSize is only aware of stripe > boundaries. This will not work when predicate pushdown (PPD) in ORC is > enabled as PPD works at row group level (stripe contains multiple row > groups). By default, row group stride is 1. When PPD is enabled, some row > groups may get eliminated. After row group elimination, disk ranges are > computed based on the selected row groups. If batchSize computation is not > aware of this, it will lead to BufferUnderFlowException (reading beyond disk > range). Following scenario should illustrate it more clearly > {code} > |- STRIPE 1 > | > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 > --| > |- diskrange 1 -| |- diskrange > 2 -| > ^ > (marker) > {code} > diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since > nextBatch() was not aware of row groups and hence the diskranges, it tries to > read 1024 values from the end of diskrange 1 where it should only read 2 > % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is > computed accordingly. {code}batchSize = > Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - > rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13886446#comment-13886446 ] Hive QA commented on HIVE-6287: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12625925/HIVE-6287.3.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 4973 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_vectorization_ppd org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1109/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1109/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12625925 > batchSize computation in Vectorized ORC reader can cause > BufferUnderFlowException when PPD is enabled > - > > Key: HIVE-6287 > URL: https://issues.apache.org/jira/browse/HIVE-6287 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 0.13.0 >Reporter: Prasanth J >Assignee: Prasanth J > Labels: orcfile, vectorization > Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, > HIVE-6287.3.patch, HIVE-6287.WIP.patch > > > nextBatch() method that computes the batchSize is only aware of stripe > boundaries. This will not work when predicate pushdown (PPD) in ORC is > enabled as PPD works at row group level (stripe contains multiple row > groups). By default, row group stride is 1. When PPD is enabled, some row > groups may get eliminated. After row group elimination, disk ranges are > computed based on the selected row groups. If batchSize computation is not > aware of this, it will lead to BufferUnderFlowException (reading beyond disk > range). Following scenario should illustrate it more clearly > {code} > |- STRIPE 1 > | > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 > --| > |- diskrange 1 -| |- diskrange > 2 -| > ^ > (marker) > {code} > diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since > nextBatch() was not aware of row groups and hence the diskranges, it tries to > read 1024 values from the end of diskrange 1 where it should only read 2 > % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is > computed accordingly. {code}batchSize = > Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - > rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13884895#comment-13884895 ] Gunther Hagleitner commented on HIVE-6287: -- Assuming tests are passing: +1 LGTM > batchSize computation in Vectorized ORC reader can cause > BufferUnderFlowException when PPD is enabled > - > > Key: HIVE-6287 > URL: https://issues.apache.org/jira/browse/HIVE-6287 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 0.13.0 >Reporter: Prasanth J >Assignee: Prasanth J > Labels: orcfile, vectorization > Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.3.patch, > HIVE-6287.WIP.patch > > > nextBatch() method that computes the batchSize is only aware of stripe > boundaries. This will not work when predicate pushdown (PPD) in ORC is > enabled as PPD works at row group level (stripe contains multiple row > groups). By default, row group stride is 1. When PPD is enabled, some row > groups may get eliminated. After row group elimination, disk ranges are > computed based on the selected row groups. If batchSize computation is not > aware of this, it will lead to BufferUnderFlowException (reading beyond disk > range). Following scenario should illustrate it more clearly > {code} > |- STRIPE 1 > | > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 > --| > |- diskrange 1 -| |- diskrange > 2 -| > ^ > (marker) > {code} > diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since > nextBatch() was not aware of row groups and hence the diskranges, it tries to > read 1024 values from the end of diskrange 1 where it should only read 2 > % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is > computed accordingly. {code}batchSize = > Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - > rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13882099#comment-13882099 ] Hive QA commented on HIVE-6287: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12625092/HIVE-6287.2.patch {color:red}ERROR:{color} -1 due to 5 failed/errored test(s), 4959 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_vectorization_ppd org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_import_exported_table org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_reducers_power_two org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_load_hdfs_file_with_space_in_the_name org.apache.hadoop.hive.cli.TestNegativeMinimrCliDriver.testNegativeCliDriver_file_with_header_footer_negative {noformat} Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1025/testReport Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1025/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 5 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12625092 > batchSize computation in Vectorized ORC reader can cause > BufferUnderFlowException when PPD is enabled > - > > Key: HIVE-6287 > URL: https://issues.apache.org/jira/browse/HIVE-6287 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 0.13.0 >Reporter: Prasanth J >Assignee: Prasanth J > Labels: orcfile, vectorization > Attachments: HIVE-6287.1.patch, HIVE-6287.2.patch, HIVE-6287.WIP.patch > > > nextBatch() method that computes the batchSize is only aware of stripe > boundaries. This will not work when predicate pushdown (PPD) in ORC is > enabled as PPD works at row group level (stripe contains multiple row > groups). By default, row group stride is 1. When PPD is enabled, some row > groups may get eliminated. After row group elimination, disk ranges are > computed based on the selected row groups. If batchSize computation is not > aware of this, it will lead to BufferUnderFlowException (reading beyond disk > range). Following scenario should illustrate it more clearly > {code} > |- STRIPE 1 > | > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 > --| > |- diskrange 1 -| |- diskrange > 2 -| > ^ > (marker) > {code} > diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since > nextBatch() was not aware of row groups and hence the diskranges, it tries to > read 1024 values from the end of diskrange 1 where it should only read 2 > % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is > computed accordingly. {code}batchSize = > Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - > rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (HIVE-6287) batchSize computation in Vectorized ORC reader can cause BufferUnderFlowException when PPD is enabled
[ https://issues.apache.org/jira/browse/HIVE-6287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13881225#comment-13881225 ] Eric Hanson commented on HIVE-6287: --- I think that by PPD you mean predicate pushdown. This was not immediately obvious to me. I edited it into the description. It's a good idea to define acronyms on first use. Thanks! > batchSize computation in Vectorized ORC reader can cause > BufferUnderFlowException when PPD is enabled > - > > Key: HIVE-6287 > URL: https://issues.apache.org/jira/browse/HIVE-6287 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 0.13.0 >Reporter: Prasanth J >Assignee: Prasanth J > Labels: orcfile, vectorization > Attachments: HIVE-6287.1.patch, HIVE-6287.WIP.patch > > > nextBatch() method that computes the batchSize is only aware of stripe > boundaries. This will not work when predicate pushdown (PPD) in ORC is > enabled as PPD works at row group level (stripe contains multiple row > groups). By default, row group stride is 1. When PPD is enabled, some row > groups may get eliminated. After row group elimination, disk ranges are > computed based on the selected row groups. If batchSize computation is not > aware of this, it will lead to BufferUnderFlowException (reading beyond disk > range). Following scenario should illustrate it more clearly > {code} > |- STRIPE 1 > | > |-- row grp 1 --|-- row grp 2 --|-- row grp 3 --|-- row grp 4 --|-- row grp 5 > --| > |- diskrange 1 -| |- diskrange > 2 -| > ^ > (marker) > {code} > diskrange1 will have 2 rows and diskrange 2 will have 1 rows. Since > nextBatch() was not aware of row groups and hence the diskranges, it tries to > read 1024 values from the end of diskrange 1 where it should only read 2 > % 1024 = 544 values. This will result in BufferUnderFlowException. > To fix this, a marker is placed at the end of each range and batchSize is > computed accordingly. {code}batchSize = > Math.min(VectorizedRowBatch.DEFAULT_SIZE, (markerPosition - > rowInStripe));{code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)