[ https://issues.apache.org/jira/browse/HIVE-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman updated HIVE-16812: ---------------------------------- Issue Type: Sub-task (was: Improvement) Parent: HIVE-20738 > VectorizedOrcAcidRowBatchReader doesn't filter delete events > ------------------------------------------------------------ > > Key: HIVE-16812 > URL: https://issues.apache.org/jira/browse/HIVE-16812 > Project: Hive > Issue Type: Sub-task > Components: Transactions > Affects Versions: 2.3.0 > Reporter: Eugene Koifman > Assignee: Eugene Koifman > Priority: Critical > Fix For: 4.0.0 > > Attachments: HIVE-16812.02.patch, HIVE-16812.04.patch, > HIVE-16812.05.patch, HIVE-16812.06.patch, HIVE-16812.07.patch > > > the c'tor of VectorizedOrcAcidRowBatchReader has > {noformat} > // Clone readerOptions for deleteEvents. > Reader.Options deleteEventReaderOptions = readerOptions.clone(); > // Set the range on the deleteEventReaderOptions to 0 to INTEGER_MAX > because > // we always want to read all the delete delta files. > deleteEventReaderOptions.range(0, Long.MAX_VALUE); > {noformat} > This is suboptimal since base and deltas are sorted by ROW__ID. So for each > split if base we can find min/max ROW_ID and only load events from delta that > are in [min,max] range. This will reduce the number of delete events we load > in memory (to no more than there in the split). > When we support sorting on PK, the same should apply but we'd need to make > sure to store PKs in ORC index > See {{OrcRawRecordMerger.discoverKeyBounds()}} > {{hive.acid.key.index}} in Orc footer has an index of ROW__IDs so we should > know min/max easily for any file written by {{OrcRecordUpdater}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)