[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451193#comment-15451193 ] Lefty Leverenz commented on HIVE-14233: --- Doc note: This adds the configuration parameter *hive.transactional.events.mem* to HiveConf.java in release 2.2.0, so it will need to be documented in the wiki. * [Configuration Properties -- Transactions and Compactor | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-TransactionsandCompactor] * [Hive Transactions -- New Configuration Parameters for Transactions | https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-NewConfigurationParametersforTransactions] Added a TODOC2.2 label. > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Labels: TODOC2.2 > Fix For: 2.2.0 > > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, > HIVE-14233.12.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451116#comment-15451116 ] Saket Saurabh commented on HIVE-14233: -- Thanks [~ekoifman] for committing the patch. > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Fix For: 2.2.0 > > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, > HIVE-14233.12.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451113#comment-15451113 ] Saket Saurabh commented on HIVE-14233: -- Thanks [~ekoifman], and [~sershe] for reviewing the patch. > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Fix For: 2.2.0 > > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, > HIVE-14233.12.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450819#comment-15450819 ] Eugene Koifman commented on HIVE-14233: --- +1 patch 12 > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, > HIVE-14233.12.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450642#comment-15450642 ] Hive QA commented on HIVE-14233: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12826257/HIVE-14233.12.patch {color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10502 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.org.apache.hadoop.hive.cli.TestCliDriver org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1 {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1047/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1047/console Test logs: http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-1047/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12826257 - PreCommit-HIVE-MASTER-Build > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, > HIVE-14233.12.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450242#comment-15450242 ] Eugene Koifman commented on HIVE-14233: --- [~saketj], I left 1 last comment on RB but it's a nit +1 pending tests > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450241#comment-15450241 ] Eugene Koifman commented on HIVE-14233: --- [~saketj], I left 1 last comment on RB but it's a nit +1 pending tests > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447222#comment-15447222 ] Eugene Koifman commented on HIVE-14233: --- Added some more comments on RB - mostly nits but not all > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446687#comment-15446687 ] Hive QA commented on HIVE-14233: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12826032/HIVE-14233.10.patch {color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10496 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.org.apache.hadoop.hive.cli.TestCliDriver org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin] org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3] {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1033/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1033/console Test logs: http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-1033/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12826032 - PreCommit-HIVE-MASTER-Build > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446668#comment-15446668 ] Saket Saurabh commented on HIVE-14233: -- Oops, forgot to do that.. sure Eugene, done that now. > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446640#comment-15446640 ] Eugene Koifman commented on HIVE-14233: --- could you upload the latest patch to RB? > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch, HIVE-14233.10.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436236#comment-15436236 ] Saket Saurabh commented on HIVE-14233: -- Thanks [~ekoifman] for the comments, working now on fixing them. > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436113#comment-15436113 ] Eugene Koifman commented on HIVE-14233: --- [~saketj] more comments on RB > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431047#comment-15431047 ] Hive QA commented on HIVE-14233: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12824840/HIVE-14233.09.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10465 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1] org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1] org.apache.hive.hcatalog.api.repl.commands.TestCommands.org.apache.hive.hcatalog.api.repl.commands.TestCommands org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler org.apache.hive.service.cli.operation.TestOperationLoggingLayout.testSwitchLogLayout {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/952/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/952/console Test logs: http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-952/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 8 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12824840 - PreCommit-HIVE-MASTER-Build > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, > HIVE-14233.09.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419809#comment-15419809 ] Saket Saurabh commented on HIVE-14233: -- It is to be noted that this patch for improved vectorization process does not handle the case when the split is on an original file (a non-acid schema file). In such cases, it resorts to the older strategy of creating vectorized row batches using row-by-row stitching. However, this performance roadblock will happen only for the non-ACID to ACID converted tables and even then will only exist till the first major compaction on the table produces a base file. > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414488#comment-15414488 ] Sergey Shelukhin commented on HIVE-14233: - Some comments on RB > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414224#comment-15414224 ] Saket Saurabh commented on HIVE-14233: -- Thanks [~sershe] for pointing that out. Have attached the link to review board for this JIRA. https://reviews.apache.org/r/50934/ > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, > HIVE-14233.06.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching
[ https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408541#comment-15408541 ] Sergey Shelukhin commented on HIVE-14233: - Is it possible to post an RB? An update to RB (or the initial post if done via rb tool, I assume) allows one to have a base patch (HIVE-14035 patch in this case, I assume)... > Improve vectorization for ACID by eliminating row-by-row stitching > -- > > Key: HIVE-14233 > URL: https://issues.apache.org/jira/browse/HIVE-14233 > Project: Hive > Issue Type: New Feature > Components: Transactions, Vectorization >Reporter: Saket Saurabh >Assignee: Saket Saurabh > Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, > HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch > > > This JIRA proposes to improve vectorization for ACID by eliminating > row-by-row stitching when reading back ACID files. In the current > implementation, a vectorized row batch is created by populating the batch one > row at a time, before the vectorized batch is passed up along the operator > pipeline. This row-by-row stitching limitation was because of the fact that > the ACID insert/update/delete events from various delta files needed to be > merged together before the actual version of a given row was found out. > HIVE-14035 has enabled us to break away from that limitation by splitting > ACID update events into a combination of delete+insert. In fact, it has now > enabled us to create splits on delta files. > Building on top of HIVE-14035, this JIRA proposes to solve this earlier > bottleneck in the vectorized code path for ACID by now directly reading row > batches from the underlying ORC files and avoiding any stitching altogether. > Once a row batch is read from the split (which may be on a base/delta file), > the deleted rows will be found by cross-referencing them against a data > structure that will just keep track of deleted events (found in the > deleted_delta files). This will lead to a large performance gain when reading > ACID files in vectorized fashion, while enabling further optimizations in > future that can be done on top of that. -- This message was sent by Atlassian JIRA (v6.3.4#6332)