[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451193#comment-15451193
 ] 

Lefty Leverenz commented on HIVE-14233:
---

Doc note:  This adds the configuration parameter 
*hive.transactional.events.mem* to HiveConf.java in release 2.2.0, so it will 
need to be documented in the wiki.

* [Configuration Properties -- Transactions and Compactor | 
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-TransactionsandCompactor]
* [Hive Transactions -- New Configuration Parameters for Transactions | 
https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-NewConfigurationParametersforTransactions]

Added a TODOC2.2 label.

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
>  Labels: TODOC2.2
> Fix For: 2.2.0
>
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, 
> HIVE-14233.12.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Saket Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451116#comment-15451116
 ] 

Saket Saurabh commented on HIVE-14233:
--

Thanks [~ekoifman] for committing the patch.

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Fix For: 2.2.0
>
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, 
> HIVE-14233.12.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Saket Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15451113#comment-15451113
 ] 

Saket Saurabh commented on HIVE-14233:
--

Thanks [~ekoifman], and [~sershe] for reviewing the patch.

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Fix For: 2.2.0
>
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, 
> HIVE-14233.12.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450819#comment-15450819
 ] 

Eugene Koifman commented on HIVE-14233:
---

+1 patch 12

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, 
> HIVE-14233.12.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450642#comment-15450642
 ] 

Hive QA commented on HIVE-14233:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12826257/HIVE-14233.12.patch

{color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10502 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.org.apache.hadoop.hive.cli.TestCliDriver
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
org.apache.hadoop.hive.ql.TestMTQueries.testMTQueries1
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1047/testReport
Console output: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1047/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-1047/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12826257 - PreCommit-HIVE-MASTER-Build

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch, 
> HIVE-14233.12.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450242#comment-15450242
 ] 

Eugene Koifman commented on HIVE-14233:
---

[~saketj], I left 1 last comment on RB but it's a nit

+1 pending tests

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-30 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15450241#comment-15450241
 ] 

Eugene Koifman commented on HIVE-14233:
---

[~saketj], I left 1 last comment on RB but it's a nit

+1 pending tests

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch, HIVE-14233.11.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-29 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15447222#comment-15447222
 ] 

Eugene Koifman commented on HIVE-14233:
---

Added some more comments on RB - mostly nits but not all

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-29 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446687#comment-15446687
 ] 

Hive QA commented on HIVE-14233:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12826032/HIVE-14233.10.patch

{color:green}SUCCESS:{color} +1 due to 3 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 10496 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.org.apache.hadoop.hive.cli.TestCliDriver
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[vector_join_part_col_char]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_3]
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1033/testReport
Console output: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/1033/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-1033/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12826032 - PreCommit-HIVE-MASTER-Build

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-29 Thread Saket Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446668#comment-15446668
 ] 

Saket Saurabh commented on HIVE-14233:
--

Oops, forgot to do that.. sure Eugene, done that now.

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-29 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446640#comment-15446640
 ] 

Eugene Koifman commented on HIVE-14233:
---

could you upload the latest patch to RB?

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch, HIVE-14233.10.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-24 Thread Saket Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436236#comment-15436236
 ] 

Saket Saurabh commented on HIVE-14233:
--

Thanks [~ekoifman] for the comments, working now on fixing them.

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-24 Thread Eugene Koifman (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15436113#comment-15436113
 ] 

Eugene Koifman commented on HIVE-14233:
---

[~saketj] more comments on RB

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-22 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15431047#comment-15431047
 ] 

Hive QA commented on HIVE-14233:




Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12824840/HIVE-14233.09.patch

{color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 8 failed/errored test(s), 10465 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver[acid_mapjoin]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[explainuser_2]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[load_dyn_part1]
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver[transform_ppr1]
org.apache.hive.hcatalog.api.repl.commands.TestCommands.org.apache.hive.hcatalog.api.repl.commands.TestCommands
org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler.org.apache.hive.hcatalog.hbase.TestPigHBaseStorageHandler
org.apache.hive.service.cli.operation.TestOperationLoggingLayout.testSwitchLogLayout
{noformat}

Test results: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/952/testReport
Console output: 
https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/952/console
Test logs: 
http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-952/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 8 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12824840 - PreCommit-HIVE-MASTER-Build

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch, 
> HIVE-14233.09.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-12 Thread Saket Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15419809#comment-15419809
 ] 

Saket Saurabh commented on HIVE-14233:
--

It is to be noted that this patch for improved vectorization process does not 
handle the case when the split is on an original file (a non-acid schema file). 
In such cases, it resorts to the older strategy of creating vectorized row 
batches using row-by-row stitching. However, this performance roadblock will 
happen only for the non-ACID to ACID converted tables and even then will only 
exist till the first major compaction on the table produces a base file.

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch, HIVE-14233.07.patch, HIVE-14233.08.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-09 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414488#comment-15414488
 ] 

Sergey Shelukhin commented on HIVE-14233:
-

Some comments on RB

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-09 Thread Saket Saurabh (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414224#comment-15414224
 ] 

Saket Saurabh commented on HIVE-14233:
--

Thanks [~sershe] for pointing that out. Have attached the link to review board 
for this JIRA. https://reviews.apache.org/r/50934/

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch, 
> HIVE-14233.06.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14233) Improve vectorization for ACID by eliminating row-by-row stitching

2016-08-04 Thread Sergey Shelukhin (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15408541#comment-15408541
 ] 

Sergey Shelukhin commented on HIVE-14233:
-

Is it possible to post an RB? An update to RB (or the initial post if done via 
rb tool, I assume) allows one to have a base patch (HIVE-14035 patch in this 
case, I assume)...

> Improve vectorization for ACID by eliminating row-by-row stitching
> --
>
> Key: HIVE-14233
> URL: https://issues.apache.org/jira/browse/HIVE-14233
> Project: Hive
>  Issue Type: New Feature
>  Components: Transactions, Vectorization
>Reporter: Saket Saurabh
>Assignee: Saket Saurabh
> Attachments: HIVE-14233.01.patch, HIVE-14233.02.patch, 
> HIVE-14233.03.patch, HIVE-14233.04.patch, HIVE-14233.05.patch
>
>
> This JIRA proposes to improve vectorization for ACID by eliminating 
> row-by-row stitching when reading back ACID files. In the current 
> implementation, a vectorized row batch is created by populating the batch one 
> row at a time, before the vectorized batch is passed up along the operator 
> pipeline. This row-by-row stitching limitation was because of the fact that 
> the ACID insert/update/delete events from various delta files needed to be 
> merged together before the actual version of a given row was found out. 
> HIVE-14035 has enabled us to break away from that limitation by splitting 
> ACID update events into a combination of delete+insert. In fact, it has now 
> enabled us to create splits on delta files.
> Building on top of HIVE-14035, this JIRA proposes to solve this earlier 
> bottleneck in the vectorized code path for ACID by now directly reading row 
> batches from the underlying ORC files and avoiding any stitching altogether. 
> Once a row batch is read from the split (which may be on a base/delta file), 
> the deleted rows will be found by cross-referencing them against a data 
> structure that will just keep track of deleted events (found in the 
> deleted_delta files). This will lead to a large performance gain when reading 
> ACID files in vectorized fashion, while enabling further optimizations in 
> future that can be done on top of that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)