[ 
https://issues.apache.org/jira/browse/HIVE-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eugene Koifman updated HIVE-17915:
----------------------------------
    Description: 
Since HIVE-12631, LLAP IO can support Acid tables but when reading "original" 
files.
HIVE-17458 enables VectorizedOrcAcidRowBatchReader to vectorize reads over 
"original" files but not with LLAP IO.

Current implementation of _OrcSplit.canUseLlapIo()_ is the same as in 
HIVE-12631.
This can/should be improved.  There are 2 parts to this:

When a read of "original" file is performed such that data doesn't need to be 
decorated with ROW__ID  (see 
__VectorizedOrcAcidRowBatchReader.canUseLlapForAcid()_) then 
VectorizedOrcAcidRowBatchReader as of HIVE-17458 should be usable with LLAP IO 
but when I tried it I got _ArrayIndexOutOfBoundsException_ in various places of 
the stack.
This is the more important one.


The 2nd issue is that reading "original" acid files (when ROW__IDs are needed) 
requires using _org.apache.hadoop.hive.ql.io.orc.RecordReader.getRowNumber()_ 
in __VectorizedOrcAcidRowBatchReader_
This API is not available on the reader that _LlapRecordReader_ provides.

It would be better if getRowNumber() was available for performance as well as 
simpler logic in the code.


cc [~sershe], [~teddy.choi]

  was:
Reading "original" acid files requires using 
_org.apache.hadoop.hive.ql.io.orc.RecordReader.getRowNumber()_ in 
__VectorizedOrcAcidRowBatchReader_
This API is not available on the reader that _LlapRecordReader_ provides so
_VectorizedOrcAcidRowBatchReader.canUseLlapForAcid()_ is used to disable LLAP 
IO in some corner cases.

It would be better if getRowNumber() was available for performance as well as 
simpler logic in the code.

This needs HIVE-17458 to be committed to make sense.

cc [~sershe], [~teddy.choi]


> Enable VectorizedOrcAcidRowBatchReader to be used with LLAP IO elevator over 
> original acid files
> ------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-17915
>                 URL: https://issues.apache.org/jira/browse/HIVE-17915
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Transactions
>    Affects Versions: 3.0.0
>            Reporter: Eugene Koifman
>            Priority: Minor
>
> Since HIVE-12631, LLAP IO can support Acid tables but when reading "original" 
> files.
> HIVE-17458 enables VectorizedOrcAcidRowBatchReader to vectorize reads over 
> "original" files but not with LLAP IO.
> Current implementation of _OrcSplit.canUseLlapIo()_ is the same as in 
> HIVE-12631.
> This can/should be improved.  There are 2 parts to this:
> When a read of "original" file is performed such that data doesn't need to be 
> decorated with ROW__ID  (see 
> __VectorizedOrcAcidRowBatchReader.canUseLlapForAcid()_) then 
> VectorizedOrcAcidRowBatchReader as of HIVE-17458 should be usable with LLAP 
> IO but when I tried it I got _ArrayIndexOutOfBoundsException_ in various 
> places of the stack.
> This is the more important one.
> The 2nd issue is that reading "original" acid files (when ROW__IDs are 
> needed) requires using 
> _org.apache.hadoop.hive.ql.io.orc.RecordReader.getRowNumber()_ in 
> __VectorizedOrcAcidRowBatchReader_
> This API is not available on the reader that _LlapRecordReader_ provides.
> It would be better if getRowNumber() was available for performance as well as 
> simpler logic in the code.
> cc [~sershe], [~teddy.choi]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to