[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice
[ https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534265#comment-17534265 ] Alessandro Solimando commented on HIVE-26150: - I tried few times to make the issue surface with _SortMergedDeleteEventRegistry_ but I haven't managed, sorry! I have tried to add some updates/inserts in between deletes (to make the condition similar to the UTs where the issue appears) but it did not reproduce, probably it's not the only condition that is required. > OrcRawRecordMerger reads each row twice > --- > > Key: HIVE-26150 > URL: https://issues.apache.org/jira/browse/HIVE-26150 > Project: Hive > Issue Type: Bug > Components: ORC, Transactions >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Priority: Major > > OrcRawRecordMerger reads each row twice, the issue does not surface since the > merger is only used with the parameter "collapseEvents" as true, which > filters out one of the two rows. > collapseEvents true and false should produce the same result, since in > current acid implementation, each event has a distinct rowid, so two > identical rows cannot be there, this is the case only for the bug. > In order to reproduce the issue, it is sufficient to set the second parameter > to false > [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106], > and run tests in TestOrcRawRecordMerger and observe two tests failing: > {code:bash} > mvn test -Dtest=TestOrcRawRecordMerger -pl ql > {code} > {noformat} > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found > unexpected row: (0,ignore.1) > [ERROR] TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found > unexpected row: (0,ignore.1) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice
[ https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528326#comment-17528326 ] Alessandro Solimando commented on HIVE-26150: - You are right, only _SortMergedDeleteEventRegistry_ uses _OrcRawRecordMerger_ when memory is tight. I found some tests covering this: *[TestVectorizedOrcAcidRowBatchReader.java#L976|https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java#L976] *[TestVectorizedOrcAcidRowBatchReader.java#L1113|https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestVectorizedOrcAcidRowBatchReader.java#L1113] The issue does not reproduce there. I noticed a difference w.r.t. the failing tests reported in the JIRA description, that is, deletes are interleaved with updates in the failing case, while we have insert followed by a bunch of deletes in the tests mentioned right above. I will try to modify the tests to add some updates in between deletes and see if I can reproduce that way. > OrcRawRecordMerger reads each row twice > --- > > Key: HIVE-26150 > URL: https://issues.apache.org/jira/browse/HIVE-26150 > Project: Hive > Issue Type: Bug > Components: ORC, Transactions >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Priority: Major > > OrcRawRecordMerger reads each row twice, the issue does not surface since the > merger is only used with the parameter "collapseEvents" as true, which > filters out one of the two rows. > collapseEvents true and false should produce the same result, since in > current acid implementation, each event has a distinct rowid, so two > identical rows cannot be there, this is the case only for the bug. > In order to reproduce the issue, it is sufficient to set the second parameter > to false > [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106], > and run tests in TestOrcRawRecordMerger and observe two tests failing: > {code:bash} > mvn test -Dtest=TestOrcRawRecordMerger -pl ql > {code} > {noformat} > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found > unexpected row: (0,ignore.1) > [ERROR] TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found > unexpected row: (0,ignore.1) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice
[ https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527968#comment-17527968 ] Peter Vary commented on HIVE-26150: --- [~asolimando]: I am trying to find out if fixing this issue would cause performance improvements when reading tables where we have delete deltas present. We have two ways to read the delete deltas: - [SortMergedDeleteEventRegistry |https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1228] - [ColumnizedDeleteEventRegistry |https://github.com/apache/hive/blob/a29810ce97a726fc70aecb53ebd648c3237106c4/ql/src/java/org/apache/hadoop/hive/ql/io/orc/VectorizedOrcAcidRowBatchReader.java#L1383] IIRC the ColumnizedDeleteEventRegistry creates its own readers and SortMergedDeleteEventRegistry uses OrcRawRecordMerger, so I would guess that the normal reads would be effected with this inefficiency when SortMergedDeleteEventRegistry is used, but I would like this to be confirmed. Thanks, Peter > OrcRawRecordMerger reads each row twice > --- > > Key: HIVE-26150 > URL: https://issues.apache.org/jira/browse/HIVE-26150 > Project: Hive > Issue Type: Bug > Components: ORC, Transactions >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Priority: Major > > OrcRawRecordMerger reads each row twice, the issue does not surface since the > merger is only used with the parameter "collapseEvents" as true, which > filters out one of the two rows. > collapseEvents true and false should produce the same result, since in > current acid implementation, each event has a distinct rowid, so two > identical rows cannot be there, this is the case only for the bug. > In order to reproduce the issue, it is sufficient to set the second parameter > to false > [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106], > and run tests in TestOrcRawRecordMerger and observe two tests failing: > {code:bash} > mvn test -Dtest=TestOrcRawRecordMerger -pl ql > {code} > {noformat} > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found > unexpected row: (0,ignore.1) > [ERROR] TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found > unexpected row: (0,ignore.1) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice
[ https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527591#comment-17527591 ] Alessandro Solimando commented on HIVE-26150: - I first discovered the issue while writing a unit test were there were no delete records ([this test|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcRawRecordMerger.java#L436]). I then checked some other tests, including those that are failing in the description of the JIRA ticket, _testRecordReaderNewBaseAndDelta_ includes some delete operations (it creates a _delete_delta_ file), is that what you mean? > OrcRawRecordMerger reads each row twice > --- > > Key: HIVE-26150 > URL: https://issues.apache.org/jira/browse/HIVE-26150 > Project: Hive > Issue Type: Bug > Components: ORC, Transactions >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Priority: Major > > OrcRawRecordMerger reads each row twice, the issue does not surface since the > merger is only used with the parameter "collapseEvents" as true, which > filters out one of the two rows. > collapseEvents true and false should produce the same result, since in > current acid implementation, each event has a distinct rowid, so two > identical rows cannot be there, this is the case only for the bug. > In order to reproduce the issue, it is sufficient to set the second parameter > to false > [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106], > and run tests in TestOrcRawRecordMerger and observe two tests failing: > {code:bash} > mvn test -Dtest=TestOrcRawRecordMerger -pl ql > {code} > {noformat} > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found > unexpected row: (0,ignore.1) > [ERROR] TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found > unexpected row: (0,ignore.1) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (HIVE-26150) OrcRawRecordMerger reads each row twice
[ https://issues.apache.org/jira/browse/HIVE-26150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17527423#comment-17527423 ] Peter Vary commented on HIVE-26150: --- [~asolimando]: Does this happen during normal read of deleted deltas? > OrcRawRecordMerger reads each row twice > --- > > Key: HIVE-26150 > URL: https://issues.apache.org/jira/browse/HIVE-26150 > Project: Hive > Issue Type: Bug > Components: ORC, Transactions >Affects Versions: 4.0.0-alpha-2 >Reporter: Alessandro Solimando >Priority: Major > > OrcRawRecordMerger reads each row twice, the issue does not surface since the > merger is only used with the parameter "collapseEvents" as true, which > filters out one of the two rows. > collapseEvents true and false should produce the same result, since in > current acid implementation, each event has a distinct rowid, so two > identical rows cannot be there, this is the case only for the bug. > In order to reproduce the issue, it is sufficient to set the second parameter > to false > [here|https://github.com/apache/hive/blob/61d4ff2be48b20df9fd24692c372ee9c2606babe/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L2103-L2106], > and run tests in TestOrcRawRecordMerger and observe two tests failing: > {code:bash} > mvn test -Dtest=TestOrcRawRecordMerger -pl ql > {code} > {noformat} > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] TestOrcRawRecordMerger.testRecordReaderNewBaseAndDelta:1332 Found > unexpected row: (0,ignore.1) > [ERROR] TestOrcRawRecordMerger.testRecordReaderOldBaseAndDelta:1208 Found > unexpected row: (0,ignore.1) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)