[ https://issues.apache.org/jira/browse/HIVE-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Koifman updated HIVE-17915: ---------------------------------- Description: Since HIVE-12631, LLAP IO can support Acid tables but when reading "original" files. HIVE-17458 enables VectorizedOrcAcidRowBatchReader to vectorize reads over "original" files but not with LLAP IO. Current implementation of _OrcSplit.canUseLlapIo()_ is the same as in HIVE-12631. This can/should be improved. There are 2 parts to this: When a read of "original" file is performed such that data doesn't need to be decorated with ROW__ID (see __VectorizedOrcAcidRowBatchReader.canUseLlapForAcid()_) then VectorizedOrcAcidRowBatchReader as of HIVE-17458 should be usable with LLAP IO but when I tried it I got _ArrayIndexOutOfBoundsException_ in various places of the stack. This is the more important one. The 2nd issue is that reading "original" acid files (when ROW__IDs are needed) requires using _org.apache.hadoop.hive.ql.io.orc.RecordReader.getRowNumber()_ in __VectorizedOrcAcidRowBatchReader_ This API is not available on the reader that _LlapRecordReader_ provides. It would be better if getRowNumber() was available for performance as well as simpler logic in the code. cc [~sershe], [~teddy.choi] was: Reading "original" acid files requires using _org.apache.hadoop.hive.ql.io.orc.RecordReader.getRowNumber()_ in __VectorizedOrcAcidRowBatchReader_ This API is not available on the reader that _LlapRecordReader_ provides so _VectorizedOrcAcidRowBatchReader.canUseLlapForAcid()_ is used to disable LLAP IO in some corner cases. It would be better if getRowNumber() was available for performance as well as simpler logic in the code. This needs HIVE-17458 to be committed to make sense. cc [~sershe], [~teddy.choi] > Enable VectorizedOrcAcidRowBatchReader to be used with LLAP IO elevator over > original acid files > ------------------------------------------------------------------------------------------------ > > Key: HIVE-17915 > URL: https://issues.apache.org/jira/browse/HIVE-17915 > Project: Hive > Issue Type: Sub-task > Components: Transactions > Affects Versions: 3.0.0 > Reporter: Eugene Koifman > Priority: Minor > > Since HIVE-12631, LLAP IO can support Acid tables but when reading "original" > files. > HIVE-17458 enables VectorizedOrcAcidRowBatchReader to vectorize reads over > "original" files but not with LLAP IO. > Current implementation of _OrcSplit.canUseLlapIo()_ is the same as in > HIVE-12631. > This can/should be improved. There are 2 parts to this: > When a read of "original" file is performed such that data doesn't need to be > decorated with ROW__ID (see > __VectorizedOrcAcidRowBatchReader.canUseLlapForAcid()_) then > VectorizedOrcAcidRowBatchReader as of HIVE-17458 should be usable with LLAP > IO but when I tried it I got _ArrayIndexOutOfBoundsException_ in various > places of the stack. > This is the more important one. > The 2nd issue is that reading "original" acid files (when ROW__IDs are > needed) requires using > _org.apache.hadoop.hive.ql.io.orc.RecordReader.getRowNumber()_ in > __VectorizedOrcAcidRowBatchReader_ > This API is not available on the reader that _LlapRecordReader_ provides. > It would be better if getRowNumber() was available for performance as well as > simpler logic in the code. > cc [~sershe], [~teddy.choi] -- This message was sent by Atlassian JIRA (v6.4.14#64029)