[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice

2022-05-10 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534265#comment-17534265
 ] 

Alessandro Solimando commented on HIVE-26150:
-

I tried few times to make the issue surface with 
_SortMergedDeleteEventRegistry_ but I haven't managed, sorry!

I have tried to add some updates/inserts in between deletes (to make the 
condition similar to the UTs where the issue appears) but it did not reproduce, 
probably it's not the only condition that is required.

> OrcRawRecordMerger reads each row twice
> ---
>
> Key: HIVE-26150
> URL: https://issues.apache.org/jira/browse/HIVE-26150
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 4.0.0-alpha-2
>Reporter: Alessandro Solimando
>Priority: Major
>
> OrcRawRecordMerger reads each row twice, the issue does not surface since the 
> merger is only used with the parameter "collapseEvents" as true, which 
> filters out one of the two rows.
> collapseEvents true and false should produce the same result, since in 
> current acid implementation, each event has a distinct rowid, so two 
> identical rows cannot be there, this is the case only for the bug.
> In order to reproduce the issue, it is sufficient to set the second parameter 
> to false 
> [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106],
>  and run tests in TestOrcRawRecordMerger and observe two tests failing:
> {code:bash}
> mvn test -Dtest=TestOrcRawRecordMerger -pl ql
> {code}
> {noformat}
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found 
> unexpected row: (0,ignore.1)
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found 
> unexpected row: (0,ignore.1)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice

2022-04-26 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528326#comment-17528326
 ] 

Alessandro Solimando commented on HIVE-26150:
-

You are right, only _SortMergedDeleteEventRegistry_ uses _OrcRawRecordMerger_ 
when memory is tight. 

I found some tests covering this:
*[TestVectorizedOrcAcidRowBatchReader.java#L976|https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java#L976]
*[TestVectorizedOrcAcidRowBatchReader.java#L1113|https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java#L1113]

The issue does not reproduce there. I noticed a difference w.r.t. the failing 
tests reported in the JIRA description, that is, deletes are interleaved with 
updates in the failing case, while we have insert followed by a bunch of 
deletes in the tests mentioned right above.

I will try to modify the tests to add some updates in between deletes and see 
if I can reproduce that way.

> OrcRawRecordMerger reads each row twice
> ---
>
> Key: HIVE-26150
> URL: https://issues.apache.org/jira/browse/HIVE-26150
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 4.0.0-alpha-2
>Reporter: Alessandro Solimando
>Priority: Major
>
> OrcRawRecordMerger reads each row twice, the issue does not surface since the 
> merger is only used with the parameter "collapseEvents" as true, which 
> filters out one of the two rows.
> collapseEvents true and false should produce the same result, since in 
> current acid implementation, each event has a distinct rowid, so two 
> identical rows cannot be there, this is the case only for the bug.
> In order to reproduce the issue, it is sufficient to set the second parameter 
> to false 
> [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106],
>  and run tests in TestOrcRawRecordMerger and observe two tests failing:
> {code:bash}
> mvn test -Dtest=TestOrcRawRecordMerger -pl ql
> {code}
> {noformat}
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found 
> unexpected row: (0,ignore.1)
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found 
> unexpected row: (0,ignore.1)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice

2022-04-26 Thread Peter Vary (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527968#comment-17527968
 ] 

Peter Vary commented on HIVE-26150:
---

[~asolimando]: I am trying to find out if fixing this issue would cause 
performance improvements when reading tables where we have delete deltas 
present.

We have two ways to read the delete deltas:
- [SortMergedDeleteEventRegistry 
|https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1228]
- [ColumnizedDeleteEventRegistry 
|https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1383]

IIRC the ColumnizedDeleteEventRegistry creates its own readers and 
SortMergedDeleteEventRegistry uses OrcRawRecordMerger, so I would guess that 
the normal reads would be effected with this inefficiency when 
SortMergedDeleteEventRegistry is used, but I would like this to be confirmed.

Thanks,
Peter

> OrcRawRecordMerger reads each row twice
> ---
>
> Key: HIVE-26150
> URL: https://issues.apache.org/jira/browse/HIVE-26150
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 4.0.0-alpha-2
>Reporter: Alessandro Solimando
>Priority: Major
>
> OrcRawRecordMerger reads each row twice, the issue does not surface since the 
> merger is only used with the parameter "collapseEvents" as true, which 
> filters out one of the two rows.
> collapseEvents true and false should produce the same result, since in 
> current acid implementation, each event has a distinct rowid, so two 
> identical rows cannot be there, this is the case only for the bug.
> In order to reproduce the issue, it is sufficient to set the second parameter 
> to false 
> [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106],
>  and run tests in TestOrcRawRecordMerger and observe two tests failing:
> {code:bash}
> mvn test -Dtest=TestOrcRawRecordMerger -pl ql
> {code}
> {noformat}
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found 
> unexpected row: (0,ignore.1)
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found 
> unexpected row: (0,ignore.1)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice

2022-04-25 Thread Alessandro Solimando (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527591#comment-17527591
 ] 

Alessandro Solimando commented on HIVE-26150:
-

I first discovered the issue while writing a unit test were there were no 
delete records ([this 
test|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcRawRecordMerger.java#L436]).

I then checked some other tests, including those that are failing in the 
description of the JIRA ticket, _testRecordReaderNewBaseAndDelta_ includes some 
delete operations (it creates a _delete_delta_ file), is that what you mean?


> OrcRawRecordMerger reads each row twice
> ---
>
> Key: HIVE-26150
> URL: https://issues.apache.org/jira/browse/HIVE-26150
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 4.0.0-alpha-2
>Reporter: Alessandro Solimando
>Priority: Major
>
> OrcRawRecordMerger reads each row twice, the issue does not surface since the 
> merger is only used with the parameter "collapseEvents" as true, which 
> filters out one of the two rows.
> collapseEvents true and false should produce the same result, since in 
> current acid implementation, each event has a distinct rowid, so two 
> identical rows cannot be there, this is the case only for the bug.
> In order to reproduce the issue, it is sufficient to set the second parameter 
> to false 
> [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106],
>  and run tests in TestOrcRawRecordMerger and observe two tests failing:
> {code:bash}
> mvn test -Dtest=TestOrcRawRecordMerger -pl ql
> {code}
> {noformat}
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found 
> unexpected row: (0,ignore.1)
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found 
> unexpected row: (0,ignore.1)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice

2022-04-25 Thread Peter Vary (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527423#comment-17527423
 ] 

Peter Vary commented on HIVE-26150:
---

[~asolimando]: Does this happen during normal read of deleted deltas?

> OrcRawRecordMerger reads each row twice
> ---
>
> Key: HIVE-26150
> URL: https://issues.apache.org/jira/browse/HIVE-26150
> Project: Hive
>  Issue Type: Bug
>  Components: ORC, Transactions
>Affects Versions: 4.0.0-alpha-2
>Reporter: Alessandro Solimando
>Priority: Major
>
> OrcRawRecordMerger reads each row twice, the issue does not surface since the 
> merger is only used with the parameter "collapseEvents" as true, which 
> filters out one of the two rows.
> collapseEvents true and false should produce the same result, since in 
> current acid implementation, each event has a distinct rowid, so two 
> identical rows cannot be there, this is the case only for the bug.
> In order to reproduce the issue, it is sufficient to set the second parameter 
> to false 
> [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106],
>  and run tests in TestOrcRawRecordMerger and observe two tests failing:
> {code:bash}
> mvn test -Dtest=TestOrcRawRecordMerger -pl ql
> {code}
> {noformat}
> [INFO] Results:
> [INFO]
> [ERROR] Failures:
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found 
> unexpected row: (0,ignore.1)
> [ERROR]   TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found 
> unexpected row: (0,ignore.1)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)